Open lintool opened 8 months ago
Relevant paper: https://arxiv.org/abs/2311.01964
Training Data Extraction:
Early Literature Review:
This is the first passage in the MS MARCO passage corpus:
{"id": "0", "contents": "The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated."}
Here's me playing around... https://chat.openai.com/share/8b19eeec-4be2-4a16-a1ad-12b68fad81f2
Investigating Data Contamination in Modern Benchmarks for Large Language Models https://arxiv.org/abs/2311.09783
some related papers:
Let's try to characterize data pollution - i.e., has this LLM been pretrained on this corpus?
Simple task: pick a random passage - chop into half. Feed first half into LLM, ask it to complete the passage.