Domain-Specific Adversarial Attacks [Harsh/David/Jan]

domenicrosati commented 4 months ago

Issue

The task is to formulate non-SFT attacks in order to test the robustness of a defence solution. The attack will need to run with an arbitrary domain like medical advice.

Types of Attacks to Implement:

Inference-time jailbreaking (HarmBench, RainbowTeaming, Tree of Attacks, RL based attacks)
Representation Engineering (RE attacks) ("latent vector") attacks
Backdoor attacks - Use a trigger

We can use BeaverTails with the currrently trained RepNoise model https://github.com/domenicrosati/representation-noising?tab=readme-ov-file#models- so its not blocked

ToDo:

[ ] Implement inference-time jailbreaking attacks with aribtrary domains (@harshraj172 )
[ ] Implement RE attacks (@janweh / David?)
[ ] Implement backdoor attacks (???)

Notes

In this setting we want to construct non-fine tuning attacks to understand the adversarial robustness of each defence:

Largely this will consist of adding inference-time methods like GCG but instead of adversarial queries of "harmful" material we are constructing adversarial attacks using queries from an unauthorized domain.

@jan -> We also need to add latent adversarial attacks like abliteration/refusal vector since these are becoming more popular.

domenicrosati commented 4 months ago

Rainbow Teaming -> https://arxiv.org/abs/2402.16822

harshraj172 commented 4 months ago

Tree of Attacks with Pruning -> https://arxiv.org/abs/2312.02119 RL based attack -> https://arxiv.org/abs/2406.08705

domenicrosati commented 4 months ago

Use RepNoise defended model: https://github.com/domenicrosati/representation-noising?tab=readme-ov-file#models- - Attack this model.

domenicrosati / training-time-domain-authorization