Model identity between the reconnaissance and attack Phases

NivC commented 8 months ago

Dear Spylab,

Thanks again for maintaining this!

I have another clarification question about the rules: During the reconnaissance phase we freely interact with the models. Would we be able to uniquely identify the model identity in the attack phase, to use specific knowledge about this model gained in the reconnaissance phase?

Also, I wanted to suggest that you might want to scale the scoring for each model by the number of attacks that succeeded against it. In that way, breaking a strong defense would give one more points (and it would make using more tokens or more messages beneficial for against stronger defneses).

Happy new year, Niv.

dpaleka commented 8 months ago

Would we be able to uniquely identify the model identity in the attack phase, to use specific knowledge about this model gained in the reconnaissance phase?

(Assuming by "model" it is "defense") Yeah, definitely. Our current implementation assumes the defense IDs will be identical; in any case it will be indicated somehow.

I wanted to suggest that you might want to scale the scoring for each model by the number of attacks that succeeded against it.

This is our current draft (not final, especially wrt what exactly is counted in P_{D, i} and how many {i} we have): Screenshot 2024-01-05 at 03-59-35 LLMs CTF Rules and Instructions

Does this conform to what you suggested? We want to have "breaking a strong defense would give one more points", but also a leaderboard (so attackers know which defenses give more points), but this means we also need incentives for attackers to actually provide information for the leaderboard.

regoar commented 8 months ago

Hi. It appears to be different from the existing published draft, which includes a deduction for each message and tokens. Now the deduction seems to be based solely on the chat number?

Anyway, I like to propose a different idea for scoring. Each defense safeguards a secret worth 1 point. This point is distributed among all successful attackers, depending on how effectively they performed:

$Effort(K, I)$ is the number of chats required by team $K$ to 'breach' defense $I$, and $\infty$ if the team did not reveal the secret. $$DefenseScore\big(I) = \sum\limits_{\LARGE{J} \atop \text{all teams}}\frac{1}{Effort(J, I)}$$ $$AttckScore\big(K, I) = \frac{1}{Effort(K, I)} \cdot \frac{1}{DefenseScore\big(I \big)}$$

$Effort(K, I)$ can be easily modified to include factors such as input tokens/messages sent, etc., and to adopt a non-linear approach. The total score for a team attack is the aggregate of its scores across all defenses. Specifically, an impenetrable defense scores 0 points.

dpaleka commented 8 months ago

https://ctf.spylab.ai/static/rules.pdf contains our current thinking on the scoring of the Attack phase.

Hi. It appears to be different from the existing published draft, which includes a deduction for each message and tokens. Now the deduction seems to be based solely on the chat number?

The message and token deductions were intended as "tiebreakers" (to fuzzily rank the teams who break the same defenses) in the case there was "one opportunity for the attacker to chat and discover the secret". We thought for a bit and figured that:

even with reconnaissance, it's going to be tough for the attackers to succeed on the first try;
having one chance at breaking any given (defense, secret) in the Evaluation phase is very unforgiving in case the attacker just messes something up. This is opposite of standard security settings, where the attacker may have some time and opportunity constraints, but we still never deem something secure if the security testing involved the attacker having only one attempt.

Hence, we now have multiple chat opportunities for the attacker, and the number of chats itself presents itself as a nice tiebreaker/deduction... So we decided to remove the other ones to make the scoring simpler. In case there are many defenses, we might still simplify by reducing T to 1.

Anyway, I like to propose a different idea for scoring. Each defense safeguards a secret worth 1 point. This point is distributed among all successful attackers, depending on how effectively they performed:

We'll consider this, and we did discuss similar CTF-like scoring rules, but currently we prefer simply reducing the defense's value by ~1% each time any attacker breaks it, instead of tracking the attack and defense scores in such an interdependent way.

NivC commented 8 months ago

Thanks for the answer @dpaleka! And thanks for the suggestion @regoar. I like that this suggestion scales well without knowing in advance the number of participating teams, or how hard the defenses would turn out to be.

NivC commented 8 months ago

Thinking about it again, I would like to ask another question on this topic @dpaleka: If the attack scores is proportional to the number of sessions, would there not be any penalty for extremely long sessions? In such a case, an attacker could use a single session to attempt many different attack strategy; or am I missing something?

Thanks! Niv.

ethz-spylab / satml-llm-ctf

Model identity between the reconnaissance and attack Phases #23