OSU-NLP-Group / GrokkedTransformer

Code for NeurIPS'24 paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization'
MIT License
162 stars 12 forks source link

Re-Implementing this work in smaller setting #5

Closed hbin0701 closed 1 month ago

hbin0701 commented 1 month ago

Hi. Thank you for sharing such an awesome work and releasing the code. I really enjoyed reading the paper! :)

I'm seeing that replicating this experiment requires a lot of compute time, so I'm wondering if this is replicable in smaller setting, with smaller size of entities and relations. Until far, I haven't been able to find one :(

If you had any smaller-sized setting you have discovered that grokking occurs (or at least in-dist generalization) while conducting experiments, it would be really appreciated if you could share..!

s-kostyaev commented 1 month ago

maybe this may help https://github.com/ironjr/grokfast

hbin0701 commented 1 month ago

Thanks for the reply :)

I have one more question. According to the composition.ipynb seems like the authors chose to use "test_inferred_facts" (which is "test_inferred_ood") as eval_dataset. I was wondering whether it would be more intuitive to use "test_inferred_iid" as eval_dataset, so that we can actually observe grokking phenomena while training. or was it set that way to confirm that generalization cannot happen for test_inferred_ood?

(With this setting, I could observe the grokking phenomena in a desired manner :))

Thanks in advance!

Boshi-Wang commented 1 month ago

Thanks for the interest!

Yes, you can scale down the number of entities/relations/attributes/etc. when generating the atomic and inferred facts; this could be done by changing the numbers in {composition/comparison}.ipynb. We only experimented with larger data sizes in the paper (see, e.g., Section 3.2) and the observations are consistent; I'm rather sure that the same observations should hold in smaller scales (not "too" small, of course).

Regarding the eval, yes we chose the OOD performance as the metric there (btw, on the comparison task the model does generalize in OOD) but you can feel free to change it (or maybe eval on both splits) according to your purpose.