Open ArthurConmy opened 1 year ago
Unclear what the solution should be.
There are plausibly three different parameter counts that are helpful:
I would appreciate people stating which parameter counts are most helpful to them
IMO this should be just total parameters for simplicity and alignment with the Pythia suite. Who cares about LayerNorm
On Tue, 14 Nov 2023, 8:09 pm ArthurConmy, @.***> wrote:
Unclear what the solution should be.
There are plausibly three different parameter counts that are helpful:
- Parameters in training
- Parameters ignoring embeddings
- Parameters used now (e.g folding layer norm deletes some parameters)
— Reply to this email directly, view it on GitHub https://github.com/neelnanda-io/TransformerLens/issues/448#issuecomment-1811155350, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRPNKNC7SNW2N2YHGVXGDDYEPFXFAVCNFSM6AAAAAA7LMEBR2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJRGE2TKMZVGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Describe the bug The
n_params
counts calculated here are wrong. For example, LLAMA uses SwiGLU so the 2x factor in the linked code is wrong. Further this just ignores bias parameters I think?Code example
System Info N/A
Additional context N/A
Checklist