domenicrosati / training-time-domain-authorization

0 stars 0 forks source link

Domain-Specific Adversarial Attacks [Harsh/David/Jan] #4

Open domenicrosati opened 2 weeks ago

domenicrosati commented 2 weeks ago

Issue

The task is to formulate non-SFT attacks in order to test the robustness of a defence solution. The attack will need to run with an arbitrary domain like medical advice.

Types of Attacks to Implement:

We can use BeaverTails with the currrently trained RepNoise model https://github.com/domenicrosati/representation-noising?tab=readme-ov-file#models- so its not blocked

ToDo:

Notes

In this setting we want to construct non-fine tuning attacks to understand the adversarial robustness of each defence:

Largely this will consist of adding inference-time methods like GCG but instead of adversarial queries of "harmful" material we are constructing adversarial attacks using queries from an unauthorized domain.

@jan -> We also need to add latent adversarial attacks like abliteration/refusal vector since these are becoming more popular.

domenicrosati commented 2 weeks ago

Rainbow Teaming -> https://arxiv.org/abs/2402.16822

harshraj172 commented 2 weeks ago

Tree of Attacks with Pruning -> https://arxiv.org/abs/2312.02119 RL based attack -> https://arxiv.org/abs/2406.08705

domenicrosati commented 2 weeks ago

Use RepNoise defended model: https://github.com/domenicrosati/representation-noising?tab=readme-ov-file#models- - Attack this model.