Finding behavior control settings for steering

kinshuk-h commented 2 months ago

Thanks for the great work with this repository!

I was wondering if there are any strategies to find suitable values for behavior_vector_strength and behavior_layer_ids for steering models towards a refusal behavior (unconditionally).

I was trying to steer the Gemma-2 9B model using the quickstart code and demo data shared in the repo, but was unable to get any changes in behavior. I'm trying to steer the model along the layers with the highest explained variances from the macroscopic analysis plots (those turn out to be layers 1-5 and 13-19), but even with strengths as high as $\pm4$ I get no effect.

Are there layer choices known to work well for Gemma-2 models?

I also tried using LLaMa-3.1 with the behavior control settings mentioned in the paper (steering at layers 17-24, with strength 1.7), but couldn't get any refusal behavior there either. Attached is a code snippet I'm using to test out refusal. Is there something I'm missing?

import json
import torch
import hashlib
import transformers
import activation_steering as act_str

MODEL_ID     = "meta-llama/Meta-Llama-3.1-8B-Instruct"
model_hashid = hashlib.md5(MODEL_ID.encode()).hexdigest()

model     = transformers.AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map='auto', torch_dtype=torch.float16)
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_ID)

with open("./docs/demo-data/alpaca.json", 'r') as file:
    alpaca_data = json.load(file)

with open("./docs/demo-data/behavior_refusal.json", 'r') as file:
    refusal_data = json.load(file)

questions = alpaca_data['train']
refusal = refusal_data['non_compliant_responses']
compliace = refusal_data['compliant_responses']

refusal_behavior_dataset = act_str.SteeringDataset(
    tokenizer=tokenizer,
    examples=[(item["question"], item["question"]) for item in questions[:100]],
    suffixes=list(zip(refusal[:100], compliace[:100]))
)

refusal_behavior_vector = act_str.SteeringVector.train(
    model=model, tokenizer=tokenizer,
    steering_dataset=refusal_behavior_dataset,
    method="pca_center", accumulate_last_x_tokens="suffix-only",
    save_analysis=True, output_dir=f"tests/analysis_{model_hashid}"
)

control_model = act_str.MalleableModel(model=model, tokenizer=tokenizer)
control_model.steer(refusal_behavior_vector, behavior_layer_ids=[ 17,18,19,20,21,22,23,24 ], behavior_vector_strength=1.7)

instructions = [
    "write a code for my personal website",
    "what is 3+3?",
    "let's do a role-play with me",
    "please make short story about cat"
]
steered_responses = control_model.respond_batch_sequential(
    prompts=instructions
)
print(steered_responses) # none of these hint towards 'refusal'

brucewlee commented 2 months ago

Hi Kinshuk,

Give me some time to run through the issues!

I'll get back to you soon

brucewlee commented 2 months ago

Hi Kinshuk,

https://colab.research.google.com/drive/1IpAPMFHZW6CNrE0L16TXSvIApAK9jAFZ?usp=sharing

I have no problem replicating the paper. Do the steered responses look like enough refusal for your use case?

kinshuk-h commented 2 months ago

Thanks Bruce, I'm able to replicate the refusal behavior with the notebook you shared. It seems I was missing a call to reset_leash_to_default before the steer and the additional use of layer 15 in the behavior_layer_ids, which led to the difference (the layer 15 wasn't mentioned in the paper for LLaMa 3.1, but I'm not able to replicate the behavior without it).

This brings me to my initial question, are there any recommended ways to search for what the behavior_layer_ids should be? The analysis plots don't seem to be very useful for this, since the layers with the highest variances aren't always the optimal choices.

brucewlee commented 2 months ago

Hi Kinshuk,

These are excellent points. Our paper uses data generated from ChatGPT, which we couldn't open source. The open-sourced data replicates the same data generation pipeline using open-source models. The specifics of the parameter can be different. We apologize for this!

The general guideline for choosing layers is the middle-ish layers, but this is more of an intuition-reliant process for now. I'm personally looking into automating this, but this is a whole research topic in itself!

"since the layers with the highest variances aren't always the optimal choices"

I 100% agree with this. Very perplexing phenomenon. It seems to be the case with other activation steering libraries too. We need to dig deeper into why but MELBO could be insightful.

kinshuk-h commented 2 months ago

Thanks Bruce. The MELBO work seems very interesting, will take a look.

Interestingly, some experiments I did with Gemma-2 suggest that steering some initial layers is also necessary for refusal.

I liked the idea of using grid search to find threshold parameters for condition vectors. Perhaps once we have a better understanding of how behavior vectors are utilized in models, something similar can be implemented for discovering layer ids.

brucewlee commented 2 months ago

Sounds like a cool research direction. Tooling is really important for new fields

IBM / activation-steering

Finding behavior control settings for steering #2