Although large language models (LLMs) are widely deployed, the data used totrain them is rarely disclosed. Given the incredible scale of this data, up totrillions of tokens, it is all but certain that it includes potentiallyproblematic text such as copyrighted materials, personally identifiableinformation, and test data for widely reported reference benchmarks. However,we currently have no way to know which data of these types is included or inwhat proportions. In this paper, we study the pretraining data detectionproblem: given a piece of text and black-box access to an LLM without knowingthe pretraining data, can we determine if the model was trained on the providedtext? To facilitate this study, we introduce a dynamic benchmark WIKIMIA thatuses data created before and after model training to support gold truthdetection. We also introduce a new detection method Min-K% Prob based on asimple hypothesis: an unseen example is likely to contain a few outlier wordswith low probabilities under the LLM, while a seen example is less likely tohave words with such low probabilities. Min-K% Prob can be applied without anyknowledge about the pretraining corpus or any additional training, departingfrom previous detection methods that require training a reference model on datathat is similar to the pretraining data. Moreover, our experiments demonstratethat Min-K% Prob achieves a 7.4% improvement on WIKIMIA over these previousmethods. We apply Min-K% Prob to two real-world scenarios, copyrighted bookdetection, and contaminated downstream example detection, and find it aconsistently effective solution.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)