AlignmentResearch / tuned-lens

Tools for understanding how transformer predictions are built layer-by-layer
https://tuned-lens.readthedocs.io/en/latest/
MIT License
432 stars 47 forks source link

Issue with TunedLens paper experiment replication #124

Closed Nicole-Nobili closed 10 months ago

Nicole-Nobili commented 10 months ago

Hello and thank you for the great work! We are trying to replicate this exact plot from the TunedLens paper, but we weren't able to:

tuned_lens_aitchison

We have used this library's lenses and also the code to calculate the aitchison similarity of the white_box folder which was present in previous commits, but there might be something we are missing.

Is there by any chance a script that you have to help us with replicating this result?

levmckinney commented 10 months ago

I believe we remove the aitchison similarity code in this commit.

If you want to replicate this figure you have two options. First, you can checkout the commit just before that and try to run the replication script found in intervention.py. Second, you can try to bring forward the code to work with the most recent version of the code base.

To do the latter you would need to do a bit of refactoring. You'll need to start with the original code for generating this figure which you can find here: https://github.com/AlignmentResearch/tuned-lens/blob/1ece574e73e3df9e150ceaa4cd129b24a4f935d2/tuned_lens/causal/intervention.py#L148. And handle things like the fact we renamed Decoder -> Unembed #55 and we removed the aitchison_similarity function which you would have to bring forward as well.

We mostly remove this code since we didn't think it was worth the time to maintain as we converted the code base from a research focus to more of a general library. If you do go the route of bring some of this stuff forward, PRs are welcome.

Nicole-Nobili commented 10 months ago

Thank you for your quick and complete reply, that's exactly what we were looking for! We'll look through the code and open a PR if we decide to bring this code forward.