EleutherAI / elk

Keeping language models honest by directly eliciting knowledge encoded in their activations.
MIT License
178 stars 33 forks source link

Fit reporters on input embeddings as a sanity check #209

Closed norabelrose closed 1 year ago

norabelrose commented 1 year ago

I noticed that on some (model, dataset) pairs the VINC and/or logistic regression AUROC is rather high after the very first layer, which seemed moderately implausible to me. It struck me that we can fit reporters to the input embeddings to sanity check our results. The idea is that if the input embedding AUROC is significantly higher than 0.5 there must be something wrong with the code or the prompt templates or both, since you can't classify a statement as true or false only by looking at its very last token and nothing else.

CLAassistant commented 1 year ago

CLA assistant check
All committers have signed the CLA.

thejaminator commented 1 year ago

lgtm!