S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models

🏆 Leaderboard

🔔 Updates

📢 If our work is useful for your research, please star ⭐ our project.

📣 [2024/10/25]: We release all 20,000 base risk prompts and 200,000 corresponding attack prompts (Version-0.1.2). We also update 🏆 LeaderBoard_v0.1.2 with new evaluation results including GPT-4 and other models. 🎉 S-Eval has achieved about 7,000 total views and about 2,000 total downloads across multiple platforms. 🎉

📣 [2024/06/17]: We further release 10,000 base risk prompts and 100,000 corresponding attack prompts (Version-0.1.1). If you require automatic safety evaluations, please feel free to submit a request via Issues or contact us by Email.

📣 [2024/05/31]: We release 20,000 corresponding attack prompts.

📣 [2024/05/23]: We publish our paper on ArXiv and first release 2,000 base risk prompts. The evaluation results in our experiments are shown in the HuggingFace 🏆 Leaderboard_v0.1.1. You can also download the benchmark from the HuggingFace Dataset.

To maintain this benchmark for satisfying evolving needs from addressing future evaluation challenges, we warmly welcome submissions of new risks and attacks to continuously augment our risk taxonomy and adversarial scenarios through this project!

💡 Overview

S-Eval is designed to be a new comprehensive, multi-dimensional and open-ended safety evaluation benchmark. So far, S-Eval has 220,000 evaluation prompts in total (and is still in active expansion), including 20,000 base risk prompts (10,000 in Chinese and 10,000 in English) and 200,000 corresponding attack prompts derived from 10 popular adversarial instruction attacks. These test prompts are generated based on a comprehensive and unified risk taxonomy, specifically designed to encompass all crucial dimensions of LLM safety evaluation and meant to accurately reflect the varied safety levels of LLMs across these risk dimensions. More details on the construction of the test suite including model-based test generation, selection and the expert critique LLM can be found in our paper.

The statistics on the risk dimensions and number of test prompts in one language of S-Eval are shown in the following table, which are the same for Chinese and English.

Risk Dimension	Risk Category	# Base	# Attack
Crimes and Illegal Activities (CI)	Pornographic Contraband	533	5330
	Drug Crimes	432	4320
	Dangerous Weapons	487	4870
	Property Infringement	400	4000
	Economic Crimes	496	4960
Cybersecurity (CS)	Access Control	228	2280
	Hacker Attack	209	2090
	Malicious Code	313	3130
	Physical Security	252	2520
Data Privacy (DP)	Personal Privacy	668	6680
Data Privacy (DP)	Commercial Secret	674	6740
Ethics and Morality (EM)	Social Ethics	493	4930
Ethics and Morality (EM)	Science Ethics	507	5070
Physical and Mental Health (PM)	Physical Harm	519	5190
Physical and Mental Health (PM)	Mental Health	483	4830
Hate Speech (HS)	Abusive Curses	296	2960
	Cyberbullying	303	3030
	Defamation	292	2920
	Threaten and Intimidate	302	3020
Extremism (EX)	Violent Terrorist Activities	207	2070
	Social Disruption	366	3660
	Extremist Ideological Trends	524	5240
Inappropriate Suggestions (IS)	Finance	341	3410
	Medicine	338	3380
	Law	337	3370
Total	-	10000	100000

❗️ Note

For prudent safety considerations, we release the benchmark by mixing only a few high-risk prompts with certain low-risk prompts.

📖 Risk Taxonomy

Our risk taxonomy has a structured hierarchy with four levels, comprising 8 risk dimensions, 25 risk categories, 56 risk subcategories, and 52 risk sub-subcategories. The first-level risk dimensions and second-level risk categories are shown in the following:

⚖️ Risk Evaluation Model

To validate the effectiveness of our risk evaluation model, we construct a test suite by collecting 1000 Chinese QA pairs and 1000 English QA pairs from Qwen-7B-Chat with manual annotation. We also compare our risk evaluation model with three baseline methods: Rule Matching, GPT-based and LLaMA-Guard-2.

For each method, we calculate balanced accuracy as well as precision and recall for every label (i.e. safe/unsafe). The bold value indicates the best.

Method	Chinese			English
Method	ACC	Precision	Recall	ACC	Precision	Recall
Rule Matching	74.12	78.46/74.44	87.08/61.15	70.19	69.42/72.01	77.54/62.84
GPT-4-Turbo	78.00	79.19/94.07	97.74/58.27	72.36	66.84/93.83	97.12/47.60
LLaMA-Guard-2	76.23	77.68/95.37	98.38/57.07	69.32	64.30/93.81	97.50/41.13
Ours	92.23	93.36/92.37	95.48/88.98	88.23	86.36/90.97	92.32/84.13

🏆 Leaderboard

You can get more detailed results from the Leaderboard.

Base Risk Prompt Set

Chinese

English

Attack Prompt Set

Chinese

English

📄 Citation

If our work is useful for your own, please cite us with the following BibTex entry:

@article{yuan2024seval,
  title={S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models},
  author={Xiaohan Yuan and Jinfeng Li and Dongxia Wang and Yuefeng Chen and Xiaofeng Mao and Longtao Huang and Hui Xue and Wenhai Wang and Kui Ren and Jingyi Wang},
  journal={arXiv preprint arXiv:2405.14191},
  year={2024}
}

⚠️ Disclaimer

S-Eval may contain offensive or upsetting content, is intended for legitimate academic research only, and is strictly prohibited for use in any commercial endeavor or for any other illegal purpose. The views expressed in the benchmark are not related to the organizations, authors and affiliated entities involved in this project. All consequences arising from the use of this benchmaek are the sole responsibility of the user. This benchmark may not be modified, distributed or otherwise misused without express permission. If you have any questions, please contact xiaohanyuan@zju.edu.cn.

🪪 License

S-Eval benchmark is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, the text of which can be found in the LICENSE file.

IS2Lab / S-Eval

readme

S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models

🏆 Leaderboard

🔔 Updates

💡 Overview

❗️ Note

📖 Risk Taxonomy

⚖️ Risk Evaluation Model

🏆 Leaderboard

Base Risk Prompt Set

Chinese

English

Attack Prompt Set

Chinese

English

📄 Citation

⚠️ Disclaimer

🪪 License