Open domenicrosati opened 4 months ago
Rainbow Teaming -> https://arxiv.org/abs/2402.16822
Tree of Attacks with Pruning -> https://arxiv.org/abs/2312.02119 RL based attack -> https://arxiv.org/abs/2406.08705
Use RepNoise defended model: https://github.com/domenicrosati/representation-noising?tab=readme-ov-file#models- - Attack this model.
Issue
The task is to formulate non-SFT attacks in order to test the robustness of a defence solution. The attack will need to run with an arbitrary domain like medical advice.
Types of Attacks to Implement:
We can use BeaverTails with the currrently trained RepNoise model https://github.com/domenicrosati/representation-noising?tab=readme-ov-file#models- so its not blocked
ToDo:
Notes
In this setting we want to construct non-fine tuning attacks to understand the adversarial robustness of each defence:
Largely this will consist of adding inference-time methods like GCG but instead of adversarial queries of "harmful" material we are constructing adversarial attacks using queries from an unauthorized domain.
@jan -> We also need to add latent adversarial attacks like abliteration/refusal vector since these are becoming more popular.