Difference to HELM benchmark

Thanks for the question. Overall, the DecodingTrust focuses on the comprehensive trustworthiness evaluations for LLMs, while HELM mainly focuses on comprehensive benign evaluation scenarios.

The detailed differences lie in three folds:

Considered perspectives (Comprehensive/holistic evaluation vs. trustworthiness evaluation) HELM and DecodingTrust have some overlapping perspectives, such as toxicity, bias, robustness, and fairness. However, since HELM lays emphasis on comprehensiveness, it includes perspectives like calibration and efficiency, which is not considered in DecodingTrust. And since DecodingTrust lays emphasis on trustworthiness, it includes as many trustworthiness perspectives as possible, and includes perspectives like ethics and privacy, which is not considered in HELM.
For each overlapping perspective, we also have a different focus from HELM. Taking “robustness” as an example, HELM studies robustness against (a) small semantics-preserving perturbations and (b) semantics-altering perturbations. In contrast, DecodingTrust considers three high-level robustness—(a) adversarial robustness, (b) OOD robustness, (c) robustness against adversarial demonstrations. And under each type of robustness, we further adopt various kinds of perturbations. (e.g., for (c), we study robustness against counterfactual demonstrations/spurious correlations in demonstrations/backdoored demonstrations)
Findings As a very concise summarization, DecodingTrust reveals (a) the performance of GPT models under different trustworthiness perspectives, and (b) the resilience of their performance in adversarial environments (e.g., adversarial system/user prompts, demonstrations). However, the second aspect hasn’t been well explored in HELM.

AI-secure / DecodingTrust

Difference to HELM benchmark #1