SchwinnL / circuit-breakers-eval

Independent robustness evaluation of Improving Alignment and Robustness with Short Circuiting
10 stars 0 forks source link

RR model release #1

Closed SophieZheng998 closed 3 weeks ago

SophieZheng998 commented 3 weeks ago

This is a very interesting work, an important attacking baseline for our future work on building robust defense. For the models: /ceph/ssd/staff/schwinn/models/Mistral-7B-Instruct-RR, and also the llama3 version of RR, will you release the model if applicable?

SchwinnL commented 3 weeks ago

Thank you for your kind words! The models were not made by us, but by the authors of the original paper, who deserve all the credit for the defence: You can find their models on HF: https://huggingface.co/collections/GraySwanAI/model-with-circuit-breakers-668ca12763d1bc005b8b2ac3 Here is a link to their Github: https://github.com/GraySwanAI/circuit-breakers

We just provide a stronger attack implementation that can break their models based on their repository.

SophieZheng998 commented 3 weeks ago

Thanks!