OpenMOSS / Language-Model-SAEs

For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research.
21 stars 3 forks source link

[Proposal] Optimize dataset loading and activation store #9

Closed Hzfinfdu closed 3 weeks ago

Hzfinfdu commented 3 weeks ago

The current activation store implementation has some drawbacks. Maybe we need to add some new features for streaming activation store and make some optimizations. Below I list some details.

  1. Text Dataset Collate Config We need to support SAE training on both pretraining and SFT data, unlike Anthropic's Scaling Monosemanticity in which only pretrained data is used to train SAEs on a supervised finetuned model.

IMO pretraining data should be packed and SFT data should be sorted by length and batched with post paddings. Activations in the residual stream of s should be ignored in SAE training. I believe this is better fitted to real-world distribution.

We need to add into the configuration to configure this.

  1. Shuffle When training SAEs with data from multiple distributions, shuffling should be an option to add to diversity of information in a batch. This can be implemented by filling in the activation buffer with random sources.