Closed persistz closed 10 months ago
To clarify, the following decisions were tradeoffs made to simplify the Evaluation phase while keeping the core of the competition intact: make defenses that protect the secret while not decreasing utility too much.
The main concerns are:
This forces us to incentivize attackers to break unbroken defenses. This makes scoring more complex, so we simplify on the other side by removing incentives to spend time optimizing for "short" attacks. Note that the init version also left the coefficients used on the tiebreaker metrics (such as number of messages) undetermined.
(...) which means that as a defender, the absolute defense is the only target? (...) If one defense A is breached 10 times, the average amount of chats required is 1000, and another defense B is breached 2 times, and the average amount of chats required is 1, according to the current rules, B will still be ranked before A?
Yes.
Does this mean that the defender only needs to be excellent in defense on a single model, without considering the average result?
This is true. This rule change was announced some time ago due to community feedback. It is difficult to make transferable defenses.
For Base score part, what is the defination of number of chats? The number of chats here refers to the number of conversations or the times of restarting conversations? If I interact with the model 50 times in a dialog, then click restart chat with the defense buttion and have another 50 rounds of conversations, will this count as 100 chats or 2 chats 2 chats.
The number of different CHATS as defined in the docs. The API is the ground truth; the dialog interface uses the underlying methods.
Just to note, we are currently leaning towards Reconnaissance being available through the dialog interface and through the API, but Evaluation only available through the API, to minimize the chance of technical issues with the interface messing up scoring.
The attackers will always just figure out a good attack via Reconnaissance first anyway, and once you know an attack that works, it should be easy to write the same attack in code, execute it in interactive Python/Jupyter, or even via the OpenAPI interface. We will release example code for Reconnaissance/Evaluation APIs that should be quite straightforward to use.
Thank you for your detailed reply, closed.
I noticed that the current defense score calculation method is different with the init version, after reading I have several questions about the new defense score calculation method:
restart chat with the defense
buttion and have another 50 rounds of conversations, will this count as 100 chats or 2 chats?