This is the official repository for Alignment for Honesty.
As for the "HHH" alignment princeple, while there has been a significant focus on enhancing the helpfulness and harmlessness of LLMs, honesty has received relatively less attention in research. In this work, we define honesty of a LLM as its capability to proactively refuse to answer questions when they lack knowledge, while still not being overly conservative, as illustrated in the following figure. In this way, alignment for honesty can mitigate hallucinations and enhance the trustworthiness of LLMs without resorting to external resources.
We provide the training and evaluation data, as well as the processing code in data. Please refer to the corresponding README for more information.
We provide the code for processing training data following our proposed honesty-oriented supervised fine-tuning methods in train. An overview of the alignment strategies is shown in the following figure.
Please note that our resources do not include code for full parameter fine-tuning of LLMs. We utilize CoLLiE in the paper; however, you are free to select an alternative LLM training repository that aligns with your specific preferences.
In the paper, we measure the performance of aligned models on several datasets, including public datasets TriviaQA, Non-AmbigQA and MMLU, and specific datasets PUQA and PKQA constructed by ourselves. Detailed evaluation code can be found in evaluation.
To say "I know" when you know, and "I don't know" when you don't, that is wisdom. — The Analects of Confucius
The two best honesty-aligned models are now available on huggingface-hub:
Model Name | HF Checkpoint | Size | License |
---|---|---|---|
confucius-confidence-verb | 🤗 GAIR/confucius-confidence-verb | 13B | Llama2-Chat |
confucius-multisample | 🤗 GAIR/confucius-multisample | 13B | Llama2-Chat |
The following two examples underscore the significance and vast potential of alignment for honesty.
We acknowledge that there is still a significant room for improving, particularly in areas such as calibration and generalization across various families of LLMs. We will focus on these refinements in the future.
If you find our work useful, please cite our paper:
@article{yang2023alignment,
title={Alignment for Honesty},
author={Yang, Yuqing and Chern, Ethan and Qiu, Xipeng and Neubig, Graham and Liu, Pengfei},
journal={arXiv preprint arXiv:2312.07000},
year={2023}
}