Your contributing doc link in the readme is broken :) So I'm making a suggestion here instead of as a pull request; but you might be interested in the PAIR team's work (pair.withgoogle.com and https://github.com/pair-code). In particular, we do a bunch of work on interpretability including:
Interactive Explorable visualizations (https://pair.withgoogle.com/explorables/) explaining important and interesting ML phenomena; of particular relevant to LLMs are:
Code/Tools: (The Learning Interpretability Toolkit/Tool) https://pair-code.github.io/lit/ a popular tool, especially in google, for using interpretability tools with ML models (most often used for language models, but works with many kinds of models and data).
Some recent papers on interpretability of language models by PAIR:
“Interpretability Illusions in the Generalization of Simplified Models” – Dan Friedman, Andrew Lampinen, Lucas Dixon, Danqi Chen, Asma Ghandeharioun. [arxiv]
(EMNLP 2024) "Self-Influence Guided Data Reweighting for Language Model Pre-training", M Thakkar, T Bolukbasi, S Ganapathy, S Vashishth, S Chandar, P Talukdar [arxiv]
(EMNLP 2024). "Data Similarity is Not Enough to Explain Language Model Performance" - Greg Yauney, Emily Reif, David Mimno [acl]
(NeurIPS 2023) "Post Hoc Explanations of Language Models Can Improve Language Models" [arxiv] - Satyapriya Krishna, Jiaqi Ma, Dylan Slack, Asma Ghandeharioun, Sameer Singh, Himabindu Lakkaraju
NeurIPS 2023 Spotlight. "Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models" [arXiv, Tweet Summary] - Peter Hase, Mohit Bansal, Been Kim, Asma Ghandeharioun
Apologies for that complete oversight. Thank you for the suggestions :). Feel free to add any of the mentioned based on the guidelines. I fixed the files and they should be available now.
Your contributing doc link in the readme is broken :) So I'm making a suggestion here instead of as a pull request; but you might be interested in the PAIR team's work (pair.withgoogle.com and https://github.com/pair-code). In particular, we do a bunch of work on interpretability including:
Interactive Explorable visualizations (https://pair.withgoogle.com/explorables/) explaining important and interesting ML phenomena; of particular relevant to LLMs are:
Code/Tools: (The Learning Interpretability Toolkit/Tool) https://pair-code.github.io/lit/ a popular tool, especially in google, for using interpretability tools with ML models (most often used for language models, but works with many kinds of models and data).
Some recent papers on interpretability of language models by PAIR:
And there's a lot more here: https://pair.withgoogle.com/research/