A crucial issue of current text generation models is that they often uncontrollably generate text that is factually inconsistent with inputs. Due to lack of annotated data, existing factual consistency metrics usually train evaluation models on synthetic texts or directly transfer from other related tasks, such as question answering (QA) and natural language inference (NLI). Bias in synthetic text or upstream tasks makes them perform poorly on text actually generated by language models, especially for general evaluation for various tasks. To alleviate this problem, we propose a weakly supervised framework named \textbf{WeCheck} that is directly trained on actual generated samples from language models with weakly annotated labels. WeCheck first utilizes a generative model to infer the factual labels of generated samples by aggregating weak labels from multiple resources. Next, we train a simple noise-aware classification model as the target metric using the inferred weakly supervised information. Comprehensive experiments on various tasks demonstrate the strong performance of WeCheck, achieving an average absolute improvement of 3.3\% on the TRUE benchmark over 11B state-of-the-art methods using only 435M parameters. Furthermore, it is up to 30 times faster than previous evaluation methods, greatly improving the accuracy and efficiency of factual consistency evaluation.

Translation (by gpt-3.5-turbo)

現在のテキスト生成モデルの重要な課題は、入力と事実に矛盾したテキストを制御できないことです。既存の事実の整合性の評価指標は、アノテーションされたデータが不足しているため、通常、合成テキストで評価モデルを訓練するか、質問応答（QA）や自然言語推論（NLI）などの関連するタスクから直接転送します。合成テキストや上流タスクのバイアスにより、特にさまざまなタスクの一般的な評価において、これらの評価指標は言語モデルによって実際に生成されたテキストで性能が低下します。この問題を軽減するために、私たちは\textbf{WeCheck}という弱教師付きフレームワークを提案します。 WeCheckは、弱教師付きラベルを持つ言語モデルから直接訓練された実際の生成サンプルを使用します。 WeCheckはまず、複数のリソースからの弱ラベルを集約して生成サンプルの事実ラベルを推論するために生成モデルを利用します。次に、推論された弱教師付き情報を使用して、単純なノイズ感知分類モデルを目標メトリックとして訓練します。さまざまなタスクでの包括的な実験は、WeCheckの強力なパフォーマンスを示し、435Mのパラメータのみを使用して、TRUEベンチマークで11Bの最先端手法に対して平均絶対改善率3.3\%を達成します。さらに、従来の評価方法よりも最大30倍高速であり、事実の整合性評価の精度と効率を大幅に向上させています。
Summary (by gpt-3.5-turbo)
現在のテキスト生成モデルは、入力と矛盾するテキストを制御できないという課題があります。この問題を解決するために、私たちはWeCheckという弱教師付きフレームワークを提案します。WeCheckは、弱教師付きラベルを持つ言語モデルから直接訓練された実際の生成サンプルを使用します。さまざまなタスクでの実験結果は、WeCheckの強力なパフォーマンスを示し、従来の評価方法よりも高速で精度と効率を向上させています。

AkihikoWatanabe / paper_notes

WeCheck: Strong Factual Consistency Checker via Weakly Supervised Learning, ACL'23 #866

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)