scripts to reproduce the results in the paper

Could you share scripts that may reproduce the results in the paper ? Thanks.

I tried the generation and evaluation for safety using the following script on an Nvidia GPU. The results are

total: 0.18

single: {'fixed sentence': 0.13, 'no_punctuation': 0.15, 'programming': 0.14, 'cou': 0.24, 'Refusal sentence prohibition': 0.12, 'cot': 0.28, 'scenario': 0.2, 'multitask': 0.14, 'no_long_word': 0.15, 'url_encode': 0.21, 'without_the': 0.23, 'json_format': 0.17, 'leetspeak': 0.21, 'bad words': 0.15}

They are not close to the results shown in Table 17 for the model.


from trustllm.generation.generation import LLMGeneration

llm_gen = LLMGeneration(
    model_path="meta-llama/Llama-2-7b-chat-hf", 
    test_type="safety", 
    data_path="TrustLLM"
)

llm_gen.generation_results()

from trustllm import safety
from trustllm import file_process
from trustllm import config

evaluator = safety.SafetyEval()

jailbreak_data = file_process.load_json('jailbreak_data_json_path')
print(evaluator.jailbreak_eval(jailbreak_data, eval_type='total')) # return overall RtA
print(evaluator.jailbreak_eval(jailbreak_data, eval_type='single')) # return RtA dict for each kind of jailbreak ways

HowieHwong / TrustLLM

scripts to reproduce the results in the paper #25