Closed kinshuk-h closed 1 month ago
Hi Kinshuk,
Give me some time to run through the issues!
I'll get back to you soon
Hi Kinshuk,
https://colab.research.google.com/drive/1IpAPMFHZW6CNrE0L16TXSvIApAK9jAFZ?usp=sharing
I have no problem replicating the paper. Do the steered responses look like enough refusal for your use case?
Thanks Bruce, I'm able to replicate the refusal behavior with the notebook you shared.
It seems I was missing a call to reset_leash_to_default
before the steer and the additional use of layer 15 in the behavior_layer_ids
, which led to the difference (the layer 15 wasn't mentioned in the paper for LLaMa 3.1, but I'm not able to replicate the behavior without it).
This brings me to my initial question, are there any recommended ways to search for what the behavior_layer_ids
should be? The analysis plots don't seem to be very useful for this, since the layers with the highest variances aren't always the optimal choices.
Hi Kinshuk,
These are excellent points. Our paper uses data generated from ChatGPT, which we couldn't open source. The open-sourced data replicates the same data generation pipeline using open-source models. The specifics of the parameter can be different. We apologize for this!
The general guideline for choosing layers is the middle-ish layers, but this is more of an intuition-reliant process for now. I'm personally looking into automating this, but this is a whole research topic in itself!
"since the layers with the highest variances aren't always the optimal choices"
I 100% agree with this. Very perplexing phenomenon. It seems to be the case with other activation steering libraries too. We need to dig deeper into why but MELBO could be insightful.
Thanks Bruce. The MELBO work seems very interesting, will take a look.
Interestingly, some experiments I did with Gemma-2 suggest that steering some initial layers is also necessary for refusal.
I liked the idea of using grid search to find threshold parameters for condition vectors. Perhaps once we have a better understanding of how behavior vectors are utilized in models, something similar can be implemented for discovering layer ids.
Sounds like a cool research direction. Tooling is really important for new fields
Thanks for the great work with this repository!
I was wondering if there are any strategies to find suitable values for
behavior_vector_strength
andbehavior_layer_ids
for steering models towards a refusal behavior (unconditionally).I was trying to steer the Gemma-2 9B model using the quickstart code and demo data shared in the repo, but was unable to get any changes in behavior. I'm trying to steer the model along the layers with the highest explained variances from the macroscopic analysis plots (those turn out to be layers 1-5 and 13-19), but even with strengths as high as $\pm4$ I get no effect.
Are there layer choices known to work well for Gemma-2 models?
I also tried using LLaMa-3.1 with the behavior control settings mentioned in the paper (steering at layers 17-24, with strength 1.7), but couldn't get any refusal behavior there either. Attached is a code snippet I'm using to test out refusal. Is there something I'm missing?