AkihikoWatanabe commented 1 year ago

URL

https://arxiv.org/abs/2305.11364
Affiliations
- Emily Reif, N/A
- Minsuk Kahng, N/A
- Savvas Petridis, N/A
  Abstract
- Large language models (LLMs) can be used to generate smaller, more refineddatasets via few-shot prompting for benchmarking, fine-tuning or other usecases. However, understanding and evaluating these datasets is difficult, andthe failure modes of LLM-generated data are still not well understood.Specifically, the data can be repetitive in surprising ways, not onlysemantically but also syntactically and lexically. We present LinguisticLens, anovel inter-active visualization tool for making sense of and analyzingsyntactic diversity of LLM-generated datasets. LinguisticLens clusters textalong syntactic, lexical, and semantic axes. It supports hierarchicalvisualization of a text dataset, allowing users to quickly scan for an overviewand inspect individual examples. The live demo is available atshorturl.at/zHOUV.
  Translation (by gpt-3.5-turbo)
大規模言語モデル（LLMs）は、ベンチマーク、ファインチューニング、その他の用途のために、few-shot promptingを介してより小さく洗練されたデータセットを生成するために使用できます。しかし、これらのデータセットを理解し評価することは困難であり、LLMによって生成されたデータの失敗モードはまだ十分に理解されていません。具体的には、データは意味的にだけでなく、構文的および語彙的にも驚くほど繰り返されることがあります。本研究では、LinguisticLensという新しいインタラクティブな可視化ツールを提供し、LLMによって生成されたデータセットの構文的多様性を理解し分析することができます。LinguisticLensは、テキストを構文、語彙、および意味の軸に沿ってクラスタリングします。テキストデータセットの階層的な可視化をサポートし、ユーザーは簡単に概要をスキャンし、個々の例を調べることができます。ライブデモはshorturl.at/zHOUVで利用可能です。
Summary (by gpt-3.5-turbo)
LLMsを使用して生成されたデータセットの構文的多様性を理解し分析するための新しい可視化ツールであるLinguisticLensが提供された。このツールは、テキストを構文、語彙、および意味の軸に沿ってクラスタリングし、階層的な可視化をサポートしている。ライブデモはshorturl.at/zHOUVで利用可能。

AkihikoWatanabe commented 1 year ago

LLMを用いてfew-shot promptingを利用して生成されたデータセットを理解し評価することは難しく、そもそもLLMによって生成されるデータの失敗に関してはあまり理解が進んでいない（e.g. repetitionなどは知られている）。この研究では、LLMによって生成されたデータセットの特性を理解するために、構文・語彙・意味の軸に沿ってクラスタリングすることで、データセットの特性を可視化することで、このような課題を解決することをサポートしている。

特に、従来研究ではGoldが存在することが前提な手法が利用されてきた（e.g. 生成データを利用しdownstream taskの予測性能で良さを測る、Gold distributionとdistributionを比較する）。しかし、このような手法では、synthetic data firstなシチュエーションで、Goldが存在しない場合に対処できない。このような問題を解決するためにGold dataが存在しない場合に、データの構文・語彙・意味に基づくクラスタリングを実施し結果を可視化し、human-in-the-loopの枠組みでデータセットの良さを検証する方法を提案している。

AkihikoWatanabe commented 1 year ago

可視化例

AkihikoWatanabe commented 1 year ago

実装: https://github.com/PAIR-code/interpretability/tree/master/data-synth-syntax

AkihikoWatanabe / paper_notes

Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models, Emily Reif+, N/A, arXiv'23 #702

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)