ethz-spylab / satml-llm-ctf

Code used to run the platform for the LLM CTF colocated with SaTML 2024
https://ctf.spylab.ai
MIT License
25 stars 6 forks source link

Clarification regarding llama vs. gpt #28

Closed NivC closed 10 months ago

NivC commented 11 months ago

Dear Spylab,

Thanks for answering my question so far!

I have two small clarification questions regarding llama vs. gpt:

  1. Will the same model be used for both user interaction and the llm filter?
  2. Should I design my defense such that the same defense will work with both models?

Thanks! Niv.

dedeswim commented 11 months ago

Hi Niv,

  1. Yes
  2. Ideally yes, as half of your points are given by how much your defense is broken with GPT-3.5 and half from how much it is broken with Llama.

I am gonna close for now. Feel free to re-open if you have further doubts!

NivC commented 11 months ago

Thanks! Niv.

s-zanella commented 11 months ago

I find the answer to the first point entirely reasonable. That is, using the same LLM for answering user queries and for filtering.

However, leaving the choice of which LLM to use to an attacker is much less reasonable and unnecessarily restrictive: in real-world scenarios, a service provider is in complete control of which model their service uses and would be able to tailor defenses to that model. Letting attackers choose the model rules out any defense that optimizes a defense prompt for a specific model and requires defenses to transfer across GPT-3.5 and Llama-2. While transferrable defenses are nice to have, I don't see how that would help identify the best (non-transferrable) defense for each model.

dpaleka commented 11 months ago

Letting attackers choose the model rules out any defense that optimizes a defense prompt for a specific model and requires defenses to transfer across GPT-3.5 and Llama-2.

This is correct. Our current approach handles this by scoring additively over both models (and not worst-case over models); but we agree this is not an ideal solution. We are working on figuring out what we can implement quickly (two defenses per team, or the defender choosing which model to use); expect an update on the mailing list today or tomorrow.

GQYZ commented 10 months ago

Since I already created a defense that can work with both models, I would like to voice my support for having the attacker be able to choose the model. I think it will be interesting to determine if there exists generalizable defense strategies that does not leverage model specific peculiarities e.g. glitch tokens that <|endoftext|> . I would like to propose a easy to implement compromise, instead of summing the score over both models, allow the defender to choose the weight within a range and calculate the score with a weighted sum, e.g. if they are more confident in GPT3.5 than Llama, they can choose a 0.6-0.4 split.

NivC commented 10 months ago

In any case, I would be happy to see a leaderboard for all options :-)

NivC commented 10 months ago

Thinking about it again we noticed another issue: the attack/defense setting assumes a black box access to the model. However, one may use interaction via one of the LLM models (say LLAMA ) to gain information about the defense and use this information to attack the same defense used on GPT other models.

Say I have a defense that is very strong with GPT model while black box. It may be compromised as the LLAMA model may leak information not only about the secret; but also on the defense itself (or vice versa).

dedeswim commented 10 months ago

Hi all,

Please see our latest Google Group post here. Feel free to let us know if this clarifies the doubts you may have about the topic.