andyzoujm / representation-engineering

Representation Engineering: A Top-Down Approach to AI Transparency
https://www.ai-transparency.org/
MIT License
702 stars 82 forks source link

LOSS does not need original, LOSS calibrate the difference between +/- ?? #49

Closed YerongLi closed 1 month ago

YerongLi commented 2 months ago

The LOSS function in the training flow should be minimizing the difference between the positive and negative hidden states. We don['t need the original activations right?

So there is no reason to keep the original hidden states. alpha * direction_hidden[i] for i in range(len(target_layers))

https://github.com/andyzoujm/representation-engineering/blame/c6394f8291a1e5914d77440a85f16823fc68f2dc/lorra_finetune/src/llama2_lorra.py#L96

https://github.com/andyzoujm/representation-engineering/blame/c6394f8291a1e5914d77440a85f16823fc68f2dc/lorra_finetune/src/llama2_lorra.py#L84

andyzoujm commented 2 months ago

Note the original hidden is different from the current lora hidden because of the adapters.