Closed NivC closed 10 months ago
Hi Niv,
I am gonna close for now. Feel free to re-open if you have further doubts!
Thanks! Niv.
I find the answer to the first point entirely reasonable. That is, using the same LLM for answering user queries and for filtering.
However, leaving the choice of which LLM to use to an attacker is much less reasonable and unnecessarily restrictive: in real-world scenarios, a service provider is in complete control of which model their service uses and would be able to tailor defenses to that model. Letting attackers choose the model rules out any defense that optimizes a defense prompt for a specific model and requires defenses to transfer across GPT-3.5 and Llama-2. While transferrable defenses are nice to have, I don't see how that would help identify the best (non-transferrable) defense for each model.
Letting attackers choose the model rules out any defense that optimizes a defense prompt for a specific model and requires defenses to transfer across GPT-3.5 and Llama-2.
This is correct. Our current approach handles this by scoring additively over both models (and not worst-case over models); but we agree this is not an ideal solution. We are working on figuring out what we can implement quickly (two defenses per team, or the defender choosing which model to use); expect an update on the mailing list today or tomorrow.
Since I already created a defense that can work with both models, I would like to voice my support for having the attacker be able to choose the model. I think it will be interesting to determine if there exists generalizable defense strategies that does not leverage model specific peculiarities e.g. glitch tokens that <|endoftext|> . I would like to propose a easy to implement compromise, instead of summing the score over both models, allow the defender to choose the weight within a range and calculate the score with a weighted sum, e.g. if they are more confident in GPT3.5 than Llama, they can choose a 0.6-0.4 split.
In any case, I would be happy to see a leaderboard for all options :-)
Thinking about it again we noticed another issue: the attack/defense setting assumes a black box access to the model. However, one may use interaction via one of the LLM models (say LLAMA ) to gain information about the defense and use this information to attack the same defense used on GPT other models.
Say I have a defense that is very strong with GPT model while black box. It may be compromised as the LLAMA model may leak information not only about the secret; but also on the defense itself (or vice versa).
Dear Spylab,
Thanks for answering my question so far!
I have two small clarification questions regarding llama vs. gpt:
Thanks! Niv.