felixbinder / introspection_self_prediction

Code for experiments on self-prediction as a way to measure introspection in LLMs
4 stars 0 forks source link

Getting logprops on property_extraction responses #22

Closed rajashree-agrawal closed 6 months ago

rajashree-agrawal commented 7 months ago

@felixbinder is there a good way to allow forwarding logprobs overrides from https://github.com/felixbinder/introspection_self_prediction_astra/blob/e43cbadecccd319d7aab61953b6e7815d5dbf38d/scripts/sweep_object_and_meta_levels.py#L83 to https://github.com/felixbinder/introspection_self_prediction_astra/blob/e43cbadecccd319d7aab61953b6e7815d5dbf38d/evals/run_meta_level.py#L241 to https://github.com/felixbinder/introspection_self_prediction_astra/blob/e43cbadecccd319d7aab61953b6e7815d5dbf38d/evals/generate_few_shot.py#L116 to https://github.com/felixbinder/introspection_self_prediction_astra/blob/e43cbadecccd319d7aab61953b6e7815d5dbf38d/evals/run_property_extraction.py#L72 so that https://github.com/felixbinder/introspection_self_prediction_astra/pull/21 will work with sweep_object_and_meta_levels.py?

felixbinder commented 7 months ago

Wait, what are you trying to achieve here? You want to change the log probs that get queried for the property extraction model? You could change the config file for property extraction.

rajashree-agrawal commented 7 months ago

Yeah, I want to pass logprobs=5 to the property extraction for the jailbreaks/harmbench eval, which I'm currently running as

python -m scripts.sweep_object_and_meta_levels \
        --study_name="jailbreaks" \
        --model_configs="gpt-3.5-turbo" \
        --task_configs="harmbench" \
        --response_property_configs="jailbroken" \
        --overrides="limit=1, strings_path=none, language_model.logprobs=5, +response_property.exclusion_rule_groups=[], +compliance_checks.exclusion_rule_groups=[]" \
        --meta_overrides="prompt=meta_level/really_minimal"

Are you suggesting that I just change https://github.com/felixbinder/introspection_self_prediction_astra/blob/e43cbadecccd319d7aab61953b6e7815d5dbf38d/evals/conf/config_property_extraction.yaml#L22 to

  logprobs: 5

globally?

felixbinder commented 7 months ago

Okay, you want logprobs for the extracted properties. I'm assuming that you want to use the extracted property logprobs at some point down the line? Just to be clear, the logprobs that you would get that way are always going to be for GPT-3.5-Turbo, since that is the model doing the property extraction based on the object-level behavior of whatever model. If it's important the property extraction in this case provides logprobs, then we can change the codebase to support it.

Could you explain what the processing step based on the extracted properties is? I assumed that it would just be "is/isn't jailbroken", for which we don't need logprobs?

You might be able to put the override for the logprobs into the response property config jailbroken as language_model.logprobs = 5, which should work if the config files are read in in the right order. If not, then that will get overridden later and you need to manually percolate it through the pipeline

felixbinder commented 7 months ago

Yeah, I want to pass logprobs=5 to the property extraction for the jailbreaks/harmbench eval, which I'm currently running as

python -m scripts.sweep_object_and_meta_levels \
        --study_name="jailbreaks" \
        --model_configs="gpt-3.5-turbo" \
        --task_configs="harmbench" \
        --response_property_configs="jailbroken" \
        --overrides="limit=1, strings_path=none, language_model.logprobs=5, +response_property.exclusion_rule_groups=[], +compliance_checks.exclusion_rule_groups=[]" \
        --meta_overrides="prompt=meta_level/really_minimal"

Are you suggesting that I just change

https://github.com/felixbinder/introspection_self_prediction_astra/blob/e43cbadecccd319d7aab61953b6e7815d5dbf38d/evals/conf/config_property_extraction.yaml#L22

to

  logprobs: 5

globally?

Changing it globally in combination with #21 should work, but for most property extractions we don't need logprobs, so we should be frugal and only change them when we need it

rajashree-agrawal commented 7 months ago

Could you explain what the processing step based on the extracted properties is? I assumed that it would just be "is/isn't jailbroken", for which we don't need logprobs?

Sometimes the classifier doesn't give a straightforward Answer: yes / Answer: no response, but we can still get a good classification using logprobs. I want to use https://github.com/felixbinder/introspection_self_prediction_astra/pull/28 to do this

rajashree-agrawal commented 7 months ago

Just to be clear, the logprobs that you would get that way are always going to be for GPT-3.5-Turbo, since that is the model doing the property extraction based on the object-level behavior of whatever model. If it's important the property extraction in this case provides logprobs, then we can change the codebase to support it.

Can this be changed? I want to be able to configure the model used in the classifier.

You might be able to put the override for the logprobs into the response property config jailbroken as language_model.logprobs = 5, which should work if the config files are read in in the right order. If not, then that will get overridden later and you need to manually percolate it through the pipeline

I was not able to get this working. Where/how does https://github.com/felixbinder/introspection_self_prediction_astra/blob/main/evals/run_property_extraction.py load the logged config files?

Is there any chance you could implement the config percolation so that, e.g., --overrides="+response_property.language_model.logprops=5, +response_property.language_model.model=gpt-4-turbo" (or similar) works as expected in https://github.com/felixbinder/introspection_self_prediction_astra/tree/jailbreak-eval at https://github.com/felixbinder/introspection_self_prediction_astra/blob/93bfec5ac55ac28092dc0d2d542f7471fd6cba89/scripts/jailbreak_eval.sh#L5-L16

felixbinder commented 7 months ago

The model that is being used is defined here: - language_model: gpt-3.5-turbo An override for that would need to be passed to the subprocess call to run_property_extraction

Why do you want to be able to change the model that does the property extraction? The aim for the property extraction is to pull out a common sense property of the object-level response. That should not be sensitive to the model that is doing the extraction—rather, we want the property extraction prompt and the meta level response property prompt to be constructed such that the behavior is clear from the description of it (because if its not, then we can't really look for generalization even if the model can be fine-tuned on the task).

For jailbroken, all we need is to know whether or not the model did the bad thing that was asked of it—or am I misunderstanding something here?

rajashree-agrawal commented 7 months ago

There's two versions of property extraction: one where we check what the model thinks ground truth is (3.5 for 3.5, 4 for 4) and one where we trust the ground truth source more as humans (4 for 3.5, 4 for 4).