Extend: Consider integrating Sparse Autoencoder Features (SAE) into library

PhilipQuirke / quanta_maths

Tool used to verify accuracy of transformer model

Apache License 2.0

1 stars 1 forks source link

Extend: Consider integrating Sparse Autoencoder Features (SAE) into library #21

Open amirabdullah19852020 opened 7 months ago

amirabdullah19852020 commented 7 months ago

Right now, the library considers MLP as atomic blocks - which is fine. However, with SAE, we can decompose these into a smaller number of interpretable features. The SAE approach is a little sensitive to hyperparameters, but would still be an interesting integration.

PhilipQuirke commented 7 months ago

Agreed. The ACDC library also has node-connection-information at attention head and MLP-neuron level - another reason to move to finer granularity (over time).