About defense value calculation

persistz commented 10 months ago

I noticed that the current defense score calculation method is different with the init version, after reading I have several questions about the new defense score calculation method:

For the defense ranking, it seems that $v_D$ is the only metric used for it, which means that as a defender, the absolute defense is the only target? It seems that the number of conversations and the number of tokens used by the attacker will be excluded when sorting the defense, right? If one defense A is breached 10 times, the average amount of chats required is 1000, and another defense B is breached 2 times, and the average amount of chats required is 1, according to the current rules, B will still be ranked before A?
When ranking the defense, max $v_D$ was taken on different models. Does this mean that the defender only needs to be excellent in defense on a single model, without considering the average result? If a defender gets a perfect score on LLaMA, he will be tied or win for first place regardless of his score on chatGPT?
For Base score part, what is the defination of number of chats? The number of chats here refers to the number of conversations or the times of restarting conversations? If I interact with the model 50 times in a dialog, then click restart chat with the defense buttion and have another 50 rounds of conversations, will this count as 100 chats or 2 chats?

dpaleka commented 10 months ago

To clarify, the following decisions were tradeoffs made to simplify the Evaluation phase while keeping the core of the competition intact: make defenses that protect the secret while not decreasing utility too much.

The main concerns are:

There could be too many defenses for attackers to spend enough time on each. We think a plausible failure mode of this competition is "a defense wins because no attackers spent enough time on it" or "multiple defenses have 0 successful attacks".
The formal Attack phase has to end a month before the SaTML onsite event.

This forces us to incentivize attackers to break unbroken defenses. This makes scoring more complex, so we simplify on the other side by removing incentives to spend time optimizing for "short" attacks. Note that the init version also left the coefficients used on the tiebreaker metrics (such as number of messages) undetermined.

(...) which means that as a defender, the absolute defense is the only target? (...) If one defense A is breached 10 times, the average amount of chats required is 1000, and another defense B is breached 2 times, and the average amount of chats required is 1, according to the current rules, B will still be ranked before A?

Yes.

Does this mean that the defender only needs to be excellent in defense on a single model, without considering the average result?

This is true. This rule change was announced some time ago due to community feedback. It is difficult to make transferable defenses.

For Base score part, what is the defination of number of chats? The number of chats here refers to the number of conversations or the times of restarting conversations? If I interact with the model 50 times in a dialog, then click restart chat with the defense buttion and have another 50 rounds of conversations, will this count as 100 chats or 2 chats 2 chats.

The number of different CHATS as defined in the docs. The API is the ground truth; the dialog interface uses the underlying methods.

Just to note, we are currently leaning towards Reconnaissance being available through the dialog interface and through the API, but Evaluation only available through the API, to minimize the chance of technical issues with the interface messing up scoring.

The attackers will always just figure out a good attack via Reconnaissance first anyway, and once you know an attack that works, it should be easy to write the same attack in code, execute it in interactive Python/Jupyter, or even via the OpenAPI interface. We will release example code for Reconnaissance/Evaluation APIs that should be quite straightforward to use.

persistz commented 10 months ago

Thank you for your detailed reply, closed.

ethz-spylab / satml-llm-ctf

About defense value calculation #36