Open jinz2014 opened 1 month ago
Hi
Could you share scripts that may reproduce the results in the paper ? Thanks.
I tried the generation and evaluation for safety using the following script on an Nvidia GPU. The results are
total: 0.18
single: {'fixed sentence': 0.13, 'no_punctuation': 0.15, 'programming': 0.14, 'cou': 0.24, 'Refusal sentence prohibition': 0.12, 'cot': 0.28, 'scenario': 0.2, 'multitask': 0.14, 'no_long_word': 0.15, 'url_encode': 0.21, 'without_the': 0.23, 'json_format': 0.17, 'leetspeak': 0.21, 'bad words': 0.15}
They are not close to the results shown in Table 17 for the model.
from trustllm.generation.generation import LLMGeneration llm_gen = LLMGeneration( model_path="meta-llama/Llama-2-7b-chat-hf", test_type="safety", data_path="TrustLLM" ) llm_gen.generation_results() from trustllm import safety from trustllm import file_process from trustllm import config evaluator = safety.SafetyEval() jailbreak_data = file_process.load_json('jailbreak_data_json_path') print(evaluator.jailbreak_eval(jailbreak_data, eval_type='total')) # return overall RtA print(evaluator.jailbreak_eval(jailbreak_data, eval_type='single')) # return RtA dict for each kind of jailbreak ways
Hi,
Thanks for your feedback. We use the same code you post for evaluation. Is there something wrong with your evaluation data? Or you can try another model to see whether the error exists.
Hi
Could you share scripts that may reproduce the results in the paper ? Thanks.
I tried the generation and evaluation for safety using the following script on an Nvidia GPU. The results are
total: 0.18
single: {'fixed sentence': 0.13, 'no_punctuation': 0.15, 'programming': 0.14, 'cou': 0.24, 'Refusal sentence prohibition': 0.12, 'cot': 0.28, 'scenario': 0.2, 'multitask': 0.14, 'no_long_word': 0.15, 'url_encode': 0.21, 'without_the': 0.23, 'json_format': 0.17, 'leetspeak': 0.21, 'bad words': 0.15}
They are not close to the results shown in Table 17 for the model.