URL

https://arxiv.org/abs//2305.14540
Affiliations
- Philippe Laban, N/A
- Wojciech Kryściński, N/A
- Divyansh Agarwal, N/A
- Alexander R. Fabbri, N/A
- Caiming Xiong, N/A
- Shafiq Joty, N/A
- Chien-Sheng Wu, N/A
  Abstract
- With the recent appearance of LLMs in practical settings, having methods thatcan effectively detect factual inconsistencies is crucial to reduce thepropagation of misinformation and improve trust in model outputs. When testingon existing factual consistency benchmarks, we find that a few large languagemodels (LLMs) perform competitively on classification benchmarks for factualinconsistency detection compared to traditional non-LLM methods. However, acloser analysis reveals that most LLMs fail on more complex formulations of thetask and exposes issues with existing evaluation benchmarks, affectingevaluation precision. To address this, we propose a new protocol forinconsistency detection benchmark creation and implement it in a 10-domainbenchmark called SummEdits. This new benchmark is 20 times more cost-effectiveper sample than previous benchmarks and highly reproducible, as we estimateinter-annotator agreement at about 0.9. Most LLMs struggle on SummEdits, withperformance close to random chance. The best-performing model, GPT-4, is still8\% below estimated human performance, highlighting the gaps in LLMs' abilityto reason about facts and detect inconsistencies when they occur.
  Translation (by gpt-3.5-turbo)
最近、実用的な環境でLLMsが登場したことにより、事実の矛盾を効果的に検出する方法が重要になってきており、これは誤情報の拡散を減らし、モデルの出力に対する信頼性を向上させるために必要です。既存の事実の整合性のベンチマークをテストすると、従来の非LLM方法と比較して、わずかなLLMが事実の矛盾検出の分類ベンチマークで競争力を持っていることがわかります。しかし、より複雑なタスクの定式化に対しては、ほとんどのLLMが失敗し、既存の評価ベンチマークに問題があることが明らかになり、評価精度に影響を与えています。これを解決するために、私たちはSummEditsという10のドメインのベンチマークで新しい不整合検出ベンチマークのプロトコルを提案し、実装しました。この新しいベンチマークは、従来のベンチマークに比べてサンプルあたりのコストが20倍低く、インターアノテーターの合意度を約0.9と推定できる高い再現性を持っています。ほとんどのLLMはSummEditsで苦戦し、ランダムなチャンスに近いパフォーマンスを示します。最も優れたパフォーマンスを発揮するモデルであるGPT-4でも、推定される人間のパフォーマンスから8％低い結果となり、LLMが事実について推論し、矛盾を検出する能力にはまだ課題があることを示しています。
Summary (by gpt-3.5-turbo)
LLMsを使用して事実の矛盾を検出することが重要であるが、既存の評価ベンチマークに問題があるため、ほとんどのLLMは複雑なタスクに失敗する。そこで、新しい不整合検出ベンチマークのプロトコルであるSummEditsを提案し、実装した。SummEditsは高い再現性を持ち、ほとんどのLLMは苦戦する。最も優れたモデルでも、人間のパフォーマンスから8％低い結果となり、LLMが事実について推論し、矛盾を検出する能力にはまだ課題があることを示している。

AkihikoWatanabe / paper_notes

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond, Philippe Laban+, N/A, arXiv'23 #763

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)