SafeAILab / RAIN

[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning
https://arxiv.org/abs/2309.07124
BSD 2-Clause "Simplified" License
84 stars 4 forks source link

robustness evaluation #6

Closed yugnaw closed 3 months ago

yugnaw commented 3 months ago

There seems to be no white-box attack part in the code, only a dataset named 7B_behavior_llama2_c.json in the adv folder, which seems to be a transfer attack evaluation, and this file has not been uploaded. @hongyanz @Liyuhui-12

yugnaw commented 3 months ago

sry, i didnt check the readme carefully as it already said how to evaluate under white box.