andyzoujm / representation-engineering

Representation Engineering: A Top-Down Approach to AI Transparency
https://www.ai-transparency.org/
MIT License
716 stars 86 forks source link

Enhancing RepControl by introducing the pca_model's `explained_variance_ratio_` #22

Open semicircle opened 11 months ago

semicircle commented 11 months ago

Currently, after training the rep_reader, the coeff variable used in the control pipeline need to be customized solely by experiment, and the value changes a lot, take the primary_emotions as a example, here's the values I found:

# LLaMA-2-Chat-13B coeff=3.0-3.5
# mistralai/Mistral-7B-Instruct-v0.1 coeff=0.5
# HuggingFaceH4/zephyr-7b-beta coeff=0.3
# openchat/openchat_3.5 coeff=0.2

This makes it challenging for RepControl to adapt to new models.

My finding is that by introducing the pca_model's explained_variance_ratio_ into control progress can make the manipulation progress more "gentle" / "accurate".

Here's the key modifications: In the rep_readers.py :

def get_rep_directions(self, model, tokenizer, hidden_states, hidden_layers, **kwargs):
        """Get PCA components for each layer"""
        directions = {}

        # like directions, save the variance ratio for each layer
        variance_ratio = {}

        for layer in hidden_layers:

             ........

            self.n_components = pca_model.n_components_
            variance_ratio[layer] = pca_model.explained_variance_ratio_

        self.variance_ratio = variance_ratio
        return directions

Each layer's variance_ratio represents how sparse or variably distributed the direction is, which can be interpreted as a 'confidence' score in the control section for that layer.

So, when manipulating the output, the activation variable is calculated as:

coeff=0.2
coeff_with_variance = 2.0

activations = {}
activations_with_variance = {}

for layer in layer_id:
    activations[layer] = torch.tensor(coeff * rep_reader.directions[layer] * rep_reader.direction_signs[layer]).to(model.device).half()

    variance_ratio = rep_reader.variance_ratio[layer][0]
    # print(variance_ratio)
    activations_with_variance[layer] = torch.tensor(coeff_with_variance * rep_reader.directions[layer] * rep_reader.direction_signs[layer] * variance_ratio).to(model.device).half()

Applying this method seems to allow all the 7B models I've tested to adapt a common coeff value, approximately around 2.0.

Theoretically, I came up with this idea when I saw the code of WrappedBlock using the controller (activations) to manipulate the tensor using a simple linear approach. So, I just take the variance_ratio into account in a most simple way. Maybe, by extracting the PCA model's underlying singular vector can gain better control over this.

Thanks for sharing this great work!

andyzoujm commented 11 months ago

Hi,

Thanks for putting this together. Seems practically very useful. Feel free to open a PR if you'd like to integrate this into the library.

Best, Andy

semicircle commented 10 months ago

Hi,

Some updates on this:

By adding this ratio to the activation doesn't mean totally stable control. The coeff have to revise to adapt the prompt, take 'anger' emotion control as an example, a happy scenario may need larger coeff than a neutral one to make the response looks anger. It seems to the activation needs to be revised accordingly.

I have noticed the newly added piecewise_linear operator there, and I am trying to add some code in parallel there to implement this feature.

Thanks~