neelnanda-io commented 6 days ago

Added support for Gemma-2 models.

Support for attention and output logit soft-capping (ie attn_scores -> 50 * tanh(attn_scores / 50), basically squashing it into the interval [-50, 50])
- Gotcha: Output logits with softcap are not invariant to adding a constant.
Support for before and after LN
Minor refactors, eg now MLP activation functions are defined as a dict in utils.py rather than a stack of if statements in both mlp.py and gated_mlp.py

Key differences between Gemma-1 and Gemma-2:

Gemma-2 has an RMS Norm at the start and end of each layer. You cannot fold the RMS Norm weights of the ones at the end.
Attn_scores and final logits are softcapped (with constants 50 and 30 respectively)
Every other attn layer (1, 3, 5, etc) uses local, sliding window attention, with window 4096 (the max ctx is 8192, so I don't know why they bother tbh)
They use grouped query attention, with group size 2 (ie every adjacent pair of heads use the same value and key)

Type of change

Please delete options that are not relevant.

[X] New feature (non-breaking change which adds functionality)
[X] This change requires a documentation update

Checklist:

[X] I have commented my code, particularly in hard-to-understand areas
[X] I have made corresponding changes to the documentation
Note: I have not manually updated the model_properties_table - is this automatic or do I need to?
[X] My changes generate no new warnings
[ ] I have added tests that prove my fix is effective or that my feature works
Given that the smallest model is 9B, tests don't really seem practical. I have manually verified that the logits agree with HuggingFace for the 9B on the string "The capital of France is", at max deviation 8.2016e-05 for float32, though notably worse for low precision: 0.5 for bfloat16, 0.0664 for float16. To me this seems annoying but acceptable. I have not tested that things work well on really long strings or generations.
[ ] New and existing unit tests pass locally with my changes
[X] I have not rewritten tests relating to key interfaces which would affect backward compatibility

neelnanda-io commented 6 days ago

Looks like abstract_attention.py assumed that n_heads * d_head == d_model, which is normally true but not true for Gemma-2 27B, I fixed that in the latest commit (this should do nothing for any model where that relation is true)

neelnanda-io commented 6 days ago

Hmm, there's at least a 0.2 difference in logits for the 27B in float32, which is concerning... And quite surprising tbh, I don't see any architectural difference between the 9B and 27B, which suggests this would only be cascading errors? I did get the HF logits on CPU and TL logits across 2 GPUs though, which might cause some additional divergence?

ArthurConmy commented 6 days ago

Hmm, there's at least a 0.2 difference in logits for the 27B in float32, which is concerning... And quite surprising tbh, I don't see any architectural difference between the 9B and 27B, which suggests this would only be cascading errors? I did get the HF logits on CPU and TL logits across 2 GPUs though, which might cause some additional divergence?

I think you should move on and put a loud warning when loading 27B model. We have lots of evidence that TransformerLens' slight differences from HuggingFace have cascading errors since we see much worse numerical errors in models with lots of layers e.g. here and here. @bryce13950 has mentioned trying to improve the numerical errors, so I doubt you made the error in implementation. The only question to me is whether the 27B model should even be merged.

neelnanda-io commented 5 days ago

Fair. I think 27B should be merged, but printing a warning sounds good. I was mostly surprised at such a big jump between 9B and 27B of the error, when even cascading errors shouldn't explain that IMO - it's like 42 vs 46 layers, just a fair bit wider and with about twice as many neurons per MLP layer

bryce13950 commented 4 days ago

I am going to try and fold in some accuracy improvements, specifically in MLPs, into this when I put it up. I wouldn’t worry about adding warnings or anything for the time being. I have a list of models to try once that is done. This is second on the list now.

bryce13950 commented 4 days ago

We crossed paths a little bit with this. I have done quite a bit with MLPs recently, and we did a few things very similar. I have been thinking about how to put the two changes together, and I think I am going to wrap up my branch first, given that it affects accuracy with existing models. I am basically redoing the entire set of components, and I would bet that what I am doing is going to increase accuracy here.

bryce13950 commented 4 days ago

Just one comment from me on the code so far. I am going to wrap up my work, and come back to this afterwards to test it, and load up the code locally to play around a bit.

JThh commented 1 day ago

Hey @neelnanda-io thanks for making this pr to support Gemma2 series model. I am aware that this pr has not been ready, but I use what it has and am training a SAE on the resid_post position. The run is being recorded here. It might also be good auxiliary reference for further code adjustments to this pr.

neelnanda-io commented 1 day ago

Cool! Let me know if you run into any issues

On Thu, 4 Jul 2024, 2:53 pm Jiatong (Julius) Han, @.***> wrote:

Hey @neelnanda-io https://github.com/neelnanda-io thanks for making this pr to support Gemma2 series model. I am aware that this pr has not been ready, but I use what it has and am training a SAE on the resid_post position. The run is being recorded here https://wandb.ai/jiatongg/sae_semantic_entropy/runs/e88i5gcc?nw=nwuserjiatongg. It might also be good auxiliary reference for further code adjustments to this pr.

— Reply to this email directly, view it on GitHub https://github.com/TransformerLensOrg/TransformerLens/pull/650#issuecomment-2209056337, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRPNKL7RALWZGEOVUCVN3LZKVHW3AVCNFSM6AAAAABKDSOZC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBZGA2TMMZTG4 . You are receiving this because you were mentioned.Message ID: @.***>

bryce13950 commented 8 hours ago

MLP outputs are now perfect on these models

bryce13950 commented 8 hours ago

TransformerLensOrg / TransformerLens

Added support for Gemma-2 #650

Type of change

Checklist: