Closed SophieZheng998 closed 3 weeks ago
Thank you for your kind words! The models were not made by us, but by the authors of the original paper, who deserve all the credit for the defence: You can find their models on HF: https://huggingface.co/collections/GraySwanAI/model-with-circuit-breakers-668ca12763d1bc005b8b2ac3 Here is a link to their Github: https://github.com/GraySwanAI/circuit-breakers
We just provide a stronger attack implementation that can break their models based on their repository.
Thanks!
This is a very interesting work, an important attacking baseline for our future work on building robust defense. For the models: /ceph/ssd/staff/schwinn/models/Mistral-7B-Instruct-RR, and also the llama3 version of RR, will you release the model if applicable?