[Replication] Interpretability Illusion

quinn-dougherty commented 3 years ago

Background

Replicate and visualize https://arxiv.org/abs/2104.07143

What to Replicate?

Modifications

Related Papers/Frameworks

josemlopez commented 2 years ago

Hi, I'll be reading and trying to replicate this paper.

About the preparation

I suppose that the idea is to use GPT-NeoX instead of BERT. I'll be starting with this: https://huggingface.co/EleutherAI/gpt-neo-1.3B which is very straightforward. Please, just tell me if there is interest in using another model(s). I'll use similar datasets for the inputs: Quora Question Pairs dataset, question answering dataset from Wikipedia, Toronto Book-Corpus dataset.

About the experiments

The replication of the paper can't be exact for obvious reasons (different models) and because here what the authors did was to find a strong pattern activation between some inputs and a very specific neuron and then try to replicate the same activation pattern with other datasets and check if there is a "similar" concept between them. We won't have the same neuron, the same inputs and the same pattern but I will try to find a generalisation of the thesis in this paper.

Please, don't hesitate to tell me if I'm wrong with something and/or if there is interest to test with some specific inputs or some specific conditions.

StellaAthena commented 2 years ago

@josemlopez did anything happen with this?

josemlopez commented 2 years ago

Hi Stella, Thanks for checking. After a few days with this, I had a storm in my company which I expect to leave behind the next week. That said, I changed my mind after not being able to replicate the experiment in my first attempt and my idea is now 1) replicate the with Bert, and once I see that Bert is working, then replicate with gpt-neox.

StellaAthena commented 2 years ago

@josemlopez Cool! I assigned you the issue and look forward to your results with BERT

EleutherAI / project-menu