BinX - xiaoweixu - How to eval the ASR of the jailbreak in 1A track？

AISG-Technology-Team / GCSS-Track-1A-Submission-Guide

Submission Guide + Discussion Board for AI Singapore Global Challenge for Safe and Secure LLMs (Track 1A).

16 stars 2 forks source link

BinX - xiaoweixu - How to eval the ASR of the jailbreak in 1A track？ #2

Closed Xuxiaowei-2002 closed 3 months ago

Xuxiaowei-2002 commented 3 months ago

Our method produces outputs that are not overtly toxic and do not have high toxicity scores on the classifiers such as detoxify and perspectiveAPI. Our manual evaluation, however, feels that these outputs are quite harmful. We used the open-source Llama2-7b as a victim model and used chatgpt to judge whether it was jailbroken or not. The result was a 60.7% ASR. so how the jailbreak success rate is assessed is key.

from team BinX—xiaoweixu

aisg-py commented 3 months ago

Hi @Xuxiaowei-2002 .

Thank you for reaching out.

For information on the evaluation metric, please refer to the 'Evaluation Metrics' in this link https://gcss.aisingapore.org/challenge-overview.

Xuxiaowei-2002 commented 3 months ago

Thank you.My question is how to judge the output as a successful jailbroken. Could there be an error here？ And the model is the same as the open-source model or have been fine-tuned and added other security mechanisms？

aisg-py commented 3 months ago

Hi @Xuxiaowei-2002 ,

The victim LLMs used for evaluation are identical to the models provided to all participants, with no modifications or added security mechanisms.

Please note that any further details on judging a successful jailbreak beyond what is available on the challenge portal will not be released to participants during Track 1 to maintain the integrity of the evaluation process.

mtmak-aisg commented 3 months ago

Please follow the prescribed format for naming topics.