URL

https://arxiv.org/abs/2310.16789
Affiliations
- Weijia Shi, N/A
- Anirudh Ajith, N/A
- Mengzhou Xia, N/A
- Yangsibo Huang, N/A
- Daogao Liu, N/A
- Terra Blevins, N/A
- Danqi Chen, N/A
- Luke Zettlemoyer, N/A
  Abstract
- Although large language models (LLMs) are widely deployed, the data used totrain them is rarely disclosed. Given the incredible scale of this data, up totrillions of tokens, it is all but certain that it includes potentiallyproblematic text such as copyrighted materials, personally identifiableinformation, and test data for widely reported reference benchmarks. However,we currently have no way to know which data of these types is included or inwhat proportions. In this paper, we study the pretraining data detectionproblem: given a piece of text and black-box access to an LLM without knowingthe pretraining data, can we determine if the model was trained on the providedtext? To facilitate this study, we introduce a dynamic benchmark WIKIMIA thatuses data created before and after model training to support gold truthdetection. We also introduce a new detection method Min-K% Prob based on asimple hypothesis: an unseen example is likely to contain a few outlier wordswith low probabilities under the LLM, while a seen example is less likely tohave words with such low probabilities. Min-K% Prob can be applied without anyknowledge about the pretraining corpus or any additional training, departingfrom previous detection methods that require training a reference model on datathat is similar to the pretraining data. Moreover, our experiments demonstratethat Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previousmethods. We apply Min-K% Prob to two real-world scenarios, copyrighted bookdetection, and contaminated downstream example detection, and find it aconsistently effective solution.
  Translation (by gpt-3.5-turbo)
大規模言語モデル（LLMs）は広く展開されていますが、それらを訓練するためのデータはほとんど公開されていません。このデータの規模は膨大で、数兆のトークンに及ぶため、著作権のある資料や個人を特定できる情報、広く報告されているベンチマークのテストデータなど、問題のあるテキストが含まれている可能性が高いです。しかし、現在のところ、どのような種類のデータが含まれているのか、どの割合で含まれているのかを知る手段はありません。本論文では、事前学習データの検出問題を研究します。つまり、与えられたテキストと事前学習データを知らない状態でLLMにアクセスし、提供されたテキストでモデルが訓練されたかどうかを判断することができるでしょうか。この研究を支援するために、事前学習前と後に作成されたデータを使用した動的ベンチマークWIKIMIAを導入します。また、新しい検出方法であるMin-K% Probを紹介します。これは、単純な仮説に基づいています：未知の例は、LLMの下で低い確率を持つアウトライアーワードがいくつか含まれる可能性が高く、一方、既知の例はそのような低い確率の単語を持つ可能性が低いです。Min-K% Probは、事前学習コーパスに関する知識や追加のトレーニングなしに適用することができます。これは、事前学習データに類似したデータ上で参照モデルをトレーニングする必要がある従来の検出方法とは異なります。さらに、実験の結果、Min-K% Probはこれらの従来の方法に比べてWIKIMIAで7.4%の改善を達成することが示されました。Min-K% Probを著作権のある書籍の検出や汚染された下流の例の検出など、2つの実世界のシナリオに適用し、一貫して効果的な解決策であることがわかりました。
Summary (by gpt-3.5-turbo)
本研究では、大規模言語モデル（LLMs）を訓練するためのデータの検出問題を研究し、新しい検出方法であるMin-K% Probを提案します。Min-K% Probは、LLMの下で低い確率を持つアウトライアーワードを検出することに基づいています。実験の結果、Min-K% Probは従来の方法に比べて7.4%の改善を達成し、著作権のある書籍の検出や汚染された下流の例の検出など、実世界のシナリオにおいて効果的な解決策であることが示されました。

AkihikoWatanabe / paper_notes

Detecting Pretraining Data from Large Language Models, Weijia Shi+, N/A, arXiv'23 #1089

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)