castorini / ura-projects

0 stars 1 forks source link

Has this LLM been pretrained on this corpus? #23

Open lintool opened 8 months ago

lintool commented 8 months ago

Let's try to characterize data pollution - i.e., has this LLM been pretrained on this corpus?

Simple task: pick a random passage - chop into half. Feed first half into LLM, ask it to complete the passage.

lintool commented 8 months ago

Relevant paper: https://arxiv.org/abs/2311.01964

AndreSlavescu commented 8 months ago

Training Data Extraction:

https://arxiv.org/abs/2311.17035

ASChampOmega commented 5 months ago

Early Literature Review:

Literature Review for Data Contamination.docx

lintool commented 5 months ago

This is the first passage in the MS MARCO passage corpus:

{"id": "0", "contents": "The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated."}

Here's me playing around... https://chat.openai.com/share/8b19eeec-4be2-4a16-a1ad-12b68fad81f2

lintool commented 5 months ago

Investigating Data Contamination in Modern Benchmarks for Large Language Models https://arxiv.org/abs/2311.09783

crystina-z commented 5 months ago

some related papers: