[Idea] Knowing When GPT Knows It's Lied

cfoster0 commented 3 years ago

Motivation

GPT models do not always tell the truth, and may knowingly make false claims [1]. From a user's standpoint, then, it is unclear how to build reliable applications on top of these models when there is a chance they'll unexpectedly spew falsehoods. Moreover, from an alignment standpoint, it may be valuable [2] to have tools available to flag a GPT's internal workings when they may be supporting deception [3], especially if future TAI is built upon autoregressive transformers. _

Hypothesis/Conjecture

There is some evidence from prompting that GPT models are able to signal when their context is nonsensical [4] and that their reasoning capabilities can be sensitive to the knowledge and belief states of virtual characters they emulate [5]. Given this, just as we might hope that a GPT has and uses a relatively simple embedding of certain concepts relevant to our values [6], we might also hope that a GPT has and uses a relatively simple embedding of "knowing that the stuff in its context window is false". If autoregressive language models (or the virtual agents they emulate) have the capacity to intentionally deceive, we may be able to ferret out a portion of that behavior by inspecting when a model knows that it is working off of false information, which should be useful for auditing the model and defusing such behavior.

_

Proposed Experiments (or series of Experiments)

Produce a method to detect at runtime when there is information in the context window of a GPT model (GPT-Neo, GPT-J, GPT-3, Jurassic-1, etc.) that it "knows" is wrong. Plot classification results of the method on basic facts that the model likely knows.

Another variation on this is, courtesy of Janus:

... identify a component of GPT's brain that is used to do basic arithmetic. e.g. 4+5. Imagine the knowledge neuron, but with a few shot prompt which contained a bunch of arithmetic equations. If we find such a thing, what happens if we make all of the examples in the few shot prompt wrong? Does the network still 'know' the right answer, like, is the component we found still active in a meaningful way? Does the logit lens show that its guess is at some point the right answer but it switches to the wrong. If we can find something like this, can we find what is being used to suppress/ignore this information? Could we modify that?

Extras: A few places one might look for evidence that a GPT knows when "it has lied": • EOT token hidden state(s) • Attention maps • Feedforward "knowledge neuron" activations [7] • Logits (either final or intermediate [8]) of the dishonest continuation in comparison with honest continuations

Since we should not expect all models to be open source, it would be great to also document what level of access is required to use the technique:

Access to model outputs/logits (i.e. everything available through the OpenAI/AI21 APIs)
Access to model outputs/logits, hidden states, and attention maps (i.e. everything accessible through HuggingFace's generate function with output_hidden_states=True and output_attentions=True)
Access to model outputs/logits, hidden states, attention maps, and final layer normalization+projection weights (i.e. everything you would need to conduct logit-lens experiments)
Access to the full model, including weights and implementation. (i.e. literally everything)

In general, it would seem that getting a better handle on the kinds of behaviors that are discovered in practice by pure self-supervised training would be helpful for clearer thinking on some of the safety properties we care about [9].

_

Let know what you people think about the hypothesis and design of experiments, in the comments below! Also, feel free to propose new/better experiments.

–

[1] https://twitter.com/ESYudkowsky/status/1285333002252247040?s=19

[2] https://www.lesswrong.com/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models

[3] https://www.alignmentforum.org/posts/zthDPAjh9w6Ytbeks/deceptive-alignment

[4] https://arr.am/2020/07/25/gpt-3-uncertainty-prompts/

[5] https://www.lesswrong.com/posts/L5JSMZQvkBAx9MD5A/to-what-extent-is-gpt-3-capable-of-reasoning

[6] https://www.lesswrong.com/posts/Nwgdq6kHke5LY692J/alignment-by-default

[7] https://arxiv.org/abs/2104.08696

[8] https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

[9] https://www.alignmentforum.org/posts/ey7jACdF4j6GrQLrG/thoughts-on-safety-in-predictive-learning

evhub commented 3 years ago

I'm a big fan of this proposal. There's probably a good chance it ends up not working, but I definitely think it should still be tried.

cfoster0 commented 3 years ago

Thanks for the feedback @evhub ! This was definitely inspired by the work you've been doing on deceptive alignment and honest prediction. Is there a particular aspect of the proposal that's interesting to you?

evhub commented 3 years ago

I'm most interested in understanding what's happening inside of large language models when they lie--e.g. does the model completely separately store the false information and the true information and the prompt just locates one or the other, or does it actually have some way of distorting the true information when it thinks the result should be false.

cfoster0 commented 2 years ago

After some discussion in the server, I'm going to try to put a bit more meat on what I mean by a model "knowing it has lied". We can build on existing work, including the strict notion of truthfulness from the TruthfulQA paper (https://www.alignmentforum.org/posts/PF58wEdztZFX2dSue/how-truthful-is-gpt-3-a-benchmark-for-language-models) and the following definition of intent misalignment, pulled from the Codex team's work (by way of Beth Barnes https://www.alignmentforum.org/posts/7Rvctxk73BrKqEaqh/call-for-research-on-evaluating-alignment-funding-advice):

... sufficient conditions for intent misalignment for a generative model [are] as follows:

We consider a model capable of some task X if it has the (possibly latent) capacity to perform task X. Some sufficient conditions for the model being capable of X would be: • It can be made to perform task X by prompt engineering, by fine-tuning on a much smaller quantity of data than used in pre-training, by model surgery, or some other technique which harnesses capabilities latent in the model rather than adding new capabilities; or • We can construct some other task Y, for which we know the model needs to do X in order to solve Y, and we observe that the model is capable of Y

We say a model is misaligned if it outputs B, in some case where the user would prefer it outputs A, and where the model is both: • capable of outputting A instead, and • capable of distinguishing between situations where the user wants it to do A and situations where the user wants it to do B

In light of this, the cases we are most interested are cases where: • The model is capable of honestly answering questions about some knowledge x -> It can be made to reliably answer nontrivial question(s) Q(x) probing x, with responses that are both literally true and informative, without adding extra knowledge about x to the network. Likely restricted to either direct prompting or potentially prompt optimization. • The model is capable of dishonestly answering questions about some knowledge x -> The model can be made to reliably answer nontrivial question(s) Q(x) probing x, with responses that are both literally false and informative, without adding extra knowledge about x to the network. Likely restricted to either direct prompting or potentially prompt optimization.

Those conditions would help build a case that there might be alignment-relevant capabilities for dishonesty latent in the network. I believe that showing that a single model has both of these capabilities in a more rigorous way would be a useful first step by itself. The impression I've gotten is that practitioners who work with large language models may find this step unnecessary since we have encountered them in the wild, but showing it with more rigor should make the case stronger.

In such cases, we want methods for understanding how the network distinguishes between and operates differently between the two cases. That may take the form of building a classifier for distinguishing between the two based on representations within the network in a way that is robust to changes in prompting. It might also take the form of looking for evidence of the network in some way activating representations of the honest answer in the process of giving dishonest answers (say, through a logit lens probe looking at intermediate layer weights assigned to the honest next token). In particular, if techniques like these help us build a more gears-level picture of what happens to the representations and internal mechanics of the network, that would be very valuable.

EleutherAI / project-menu