The LOSS function in the training flow should be minimizing the difference between the positive and negative hidden states. We don['t need the original activations right?
So there is no reason to keep the original hidden states.alpha * direction_hidden[i] for i in range(len(target_layers))
The LOSS function in the training flow should be minimizing the difference between the positive and negative hidden states. We don['t need the original activations right?
So there is no reason to keep the original hidden states.
alpha * direction_hidden[i] for i in range(len(target_layers))
https://github.com/andyzoujm/representation-engineering/blame/c6394f8291a1e5914d77440a85f16823fc68f2dc/lorra_finetune/src/llama2_lorra.py#L96
https://github.com/andyzoujm/representation-engineering/blame/c6394f8291a1e5914d77440a85f16823fc68f2dc/lorra_finetune/src/llama2_lorra.py#L84