HowieHwong / TrustLLM

[ICML 2024] TrustLLM: Trustworthiness in Large Language Models
https://trustllmbenchmark.github.io/TrustLLM-Website/
MIT License
357 stars 30 forks source link

scripts to reproduce the results in the paper #25

Open jinz2014 opened 1 month ago

jinz2014 commented 1 month ago

Hi

Could you share scripts that may reproduce the results in the paper ? Thanks.

I tried the generation and evaluation for safety using the following script on an Nvidia GPU. The results are

total: 0.18

single: {'fixed sentence': 0.13, 'no_punctuation': 0.15, 'programming': 0.14, 'cou': 0.24, 'Refusal sentence prohibition': 0.12, 'cot': 0.28, 'scenario': 0.2, 'multitask': 0.14, 'no_long_word': 0.15, 'url_encode': 0.21, 'without_the': 0.23, 'json_format': 0.17, 'leetspeak': 0.21, 'bad words': 0.15}

They are not close to the results shown in Table 17 for the model.


from trustllm.generation.generation import LLMGeneration

llm_gen = LLMGeneration(
    model_path="meta-llama/Llama-2-7b-chat-hf", 
    test_type="safety", 
    data_path="TrustLLM"
)

llm_gen.generation_results()

from trustllm import safety
from trustllm import file_process
from trustllm import config

evaluator = safety.SafetyEval()

jailbreak_data = file_process.load_json('jailbreak_data_json_path')
print(evaluator.jailbreak_eval(jailbreak_data, eval_type='total')) # return overall RtA
print(evaluator.jailbreak_eval(jailbreak_data, eval_type='single')) # return RtA dict for each kind of jailbreak ways
HowieHwong commented 1 month ago

Hi,

Thanks for your feedback. We use the same code you post for evaluation. Is there something wrong with your evaluation data? Or you can try another model to see whether the error exists.