[RFP] Does the order of training data influence memorization?

StellaAthena commented 3 years ago

Background

It’s pretty well known that neural networks, including transformers, can “memorize” data (Intro blog post, Paper 1, Paper 2). This can lead to transformers regurgitating exact copies of texts in the training data, a phenomenon that can cause legal and ethical problems, as well as compromising data privacy.

Core Idea

It seems intuitive that there is a correlation between how long ago a model saw data and how well memorized it is.

What to plot?

Take GPT-Neo 1.3B, GPT-J, or some other autoregressive transformer with known training data and prompt it with the first 20 tokens of each document in its train set. Count the number of subsequent tokens correctly reproduced and plot # reproduced vs the position of the document in the training data. There are a bunch of other metrics we can look at too, but this seems like a good one to start with.

Related Papers/Frameworks

HuggingFace’s transformers library allows for easy interaction with models such as GPT-Neo 1.3B.
It might be easier to do dev work on a smaller model like GPT-Neo 125M, though we'll want to do the real tests on larger models.
Both GPT-Neo and GPT-J were trained on the Pile.

GitHub Repo

https://github.com/StellaAthena/transformer-memorization

bhadreshpsavani commented 3 years ago

Hi, This seems interesting task to me, I would like to work on it!

StellaAthena commented 3 years ago

Hi, This seems interesting task to me, I would like to work on it!

Awesome! You should be able to get started on Google Colab, though we'll need to use a computing cluster for larger models if we don't want it to take forever. Have you used any of our code before? The first step will probably simply be getting used to running the code base.

bhadreshpsavani commented 3 years ago

I once did fine-tuning on GPT-Neo smaller version for text generation task. I will start by creating colab notebook.

CurtisASmith commented 3 years ago

I've been keeping this idea in the back of my head for a few weeks, and I have some thoughts.

"Take GPT-Neo 1.3B, GPT-J, or some other autoregressive transformer with known training data and prompt it with the first 20 tokens of each document in its train set". While we do know the training data, we actually know which order each particular model read the data, "the position of the document in the training data"? From what I understand, the dataset was shuffled before training those models, correct me if I'm wrong.

The graph I have attached is an example of two GPT-J models being fine-tuned with the same data and hyperparameters - the only difference is that the data was shuffled for the orange model, just to demonstrate that knowing the order of the documents during training is likely to be vitally important for interpreting the results. Assuming we cannot figure out the actual document training order, might it be appropriate to use variant fine-tuned models for this experiment?

StellaAthena commented 3 years ago

I've been keeping this idea in the back of my head for a few weeks, and I have some thoughts.

"Take GPT-Neo 1.3B, GPT-J, or some other autoregressive transformer with known training data and prompt it with the first 20 tokens of each document in its train set". While we do know the training data, we actually know which order each particular model read the data, "the position of the document in the training data"? From what I understand, the dataset was shuffled before training those models, correct me if I'm wrong.

It was shuffled, but we saved the shuffled data post-shuffle. We know exactly what order tokens were fed into both GPT-Neo and GPT-J.

The graph I have attached is an example of two GPT-J models being fine-tuned with the same data and hyperparameters - the only difference is that the data was shuffled for the orange model, just to demonstrate that knowing the order of the documents during training is likely to be vitally important for interpreting the results. Assuming we cannot figure out the actual document training order, might it be appropriate to use variant fine-tuned models for this experiment?

I'm not sure I understand this plot. Are you shuffling at the token or document level? What was the data you finetuned on? If you shuffled at the document level (shuffling at a finer level is wrong because it destroys intratext structure) it's hard for me to believe that shuffling caused such a dramatic effect on prediction. If you shuffle the data again, does it cling closely to the orange curve or does it again show great variance?

StellaAthena commented 3 years ago

I once did fine-tuning on GPT-Neo smaller version for text generation task. I will start by creating colab notebook.

Hey @bhadreshpsavani have you been able to make any progress on this?

CurtisASmith commented 3 years ago

It was shuffled, but we saved the shuffled data post-shuffle. We know exactly what order tokens were fed into both GPT-Neo and GPT-J.

My concern is unfounded then, glad to hear it.

I'm not sure I understand this plot. Are you shuffling at the token or document level? What was the data you finetuned on? If you shuffled at the document level (shuffling at a finer level is wrong because it destroys intratext structure) it's hard for me to believe that shuffling caused such a dramatic effect on prediction. If you shuffle the data again, does it cling closely to the orange curve or does it again show great variance?

The data is shuffled at a document level with this script. The dataset in question has three broad categories of documents: fiction, metafiction (an imprecise term, mostly metadata-tagged stories, writing prompts with responses, and analyses of fiction), and non-fiction. Without shuffling, the model saw documents from those categories in that order; with shuffling, the distribution should be randomized. My assumption was that the downward loss spikes on the green graph roughly correlate with a change in the data domain. Intuitively it makes sense to me that the more uniformly distributed data had a smoother graph, but you also have considerably more experience than I... so, if that doesn't explain it and you're still curious, I can try shuffling and training again.

bhadreshpsavani commented 3 years ago

Hi @StellaAthena, I got busy will some personal things, Will progress this week.

leogao2 commented 3 years ago

@CurtisASmith Is this plot train or val? If the former, can you post the val plot? I think it would be important to see that, since obviously the training data distribution varies throughout the plot and makes a comparison of the models' quality difficult. (Even better would be to run it on eval harness and see downstream task performance)

StellaAthena commented 3 years ago

Hi @StellaAthena, I got busy will some personal things, Will progress this week.

No worries! We are all working in our free time, and I understand that like takes precedence.

Any updates?

dschonholtz commented 3 years ago

I'd be interesting in giving this a shot. Do we have the order of input training data for the large pretrained models or are we going to have to train everything from scratch collecting that data for the order of the input training data as we go? Or the seed that was used for the random shuffle?

bhadreshpsavani commented 3 years ago

Hi @dschonholtz, Please go ahead with your Analysis since its such task multiple people can show their findings, I am stuck in few personal things.

I have query as well!

If we use pile data, How can we sample of data for our analysis? Since it's huge data we can't really download it on local machine as well on colab

StellaAthena commented 3 years ago

@dschonholtz

Do we have the order of input training data for the large pretrained models or are we going to have to train everything from scratch collecting that data for the order of the input training data as we go? Or the seed that was used for the random shuffle?

The Pile comes pre-shuffled, and was not shuffled additionally before training.

If we use pile data, How can we sample of data for our analysis? Since it's huge data we can't really download it on local machine as well on colab

You can find the data broken into 16 GB chunks here. The files are numbered sequentially in training order, so first we trained on 0.jsonl.zst, then 1.jsonl.zst, etc.

bhadreshpsavani commented 3 years ago

Hi @StellaAthena, Even one Chunk of data is giving OOM error on colab. Is there any other way to get even small data than this 14gb data?

StellaAthena commented 3 years ago

Hi @StellaAthena, Even one Chunk of data is giving OOM error on colab. Is there any other way to get even small data than this 14gb data?

@bhadreshpsavani what is a good size chunk to use? 1 GB?

bhadreshpsavani commented 3 years ago

Hi @StellaAthena, 1gb will work I guess!, For analysis 500mb will also be fine.

dschonholtz commented 3 years ago

Colab was giving me issues similar to the ones described above so I'm running things locally. I'm making a few assumptions. Since I'm a novice if you could tell me if I'm going off wildly in the wrong direction that'd be appreciated.

Assumptions:

Using the pretrained 2.7B hugging face model locally should be fine. (10 seconds to run a short generation on my machine? Maybe will take several hours to get enough data. If this gives me an excuse to upgrade the 1060 I'll take it)
Greedy search, with an extremely low temperature, is likely to get us to reproduce data the most consistently, but too low of a temperature or too greedy of a model is going to create boring repetitive output so we likely want to experiment with those parameters to find what is optimal. I'm guessing we'd see the optimal params change as the model size varied too.
The simplest way to get possibly interesting data for this will be to grab the first 16 GB dataset sample some docs and make predictions then grab the last 16 GB dataset do the same sampling and predictions and see if there is a measurable difference in similarity between the prediction and actual dataset. It would then be nice to do something similar for the middle chunks of the pile to see the change in regurgitation over time.
- The sampling described above in more detail: Grab the first 5 docs from a given pile chunk then grab 100? docs distributed evenly throughout that pile chunk and maybe the last 5 docs. Attempt to reproduce those docs by feeding the first 5, 25 125 and 625 tokens into the model to see if input sequence length drastically impacts regurgitation while we are at it. Vary temp, sampling etc to see if those also have a large impact.
- Similarity measure for starters will just be to see if we are naively regurgitating stuff. As in, how many of the words that the model output match exactly in letters and sequence to the rest of the document. Maybe we start doing topic modeling or looking at word stems of the output data as we go. I'm going to assume that is a problem for after we have multiple dataset outputs to play with.

If it is known what param configurations I should start with on any of the above, let me know. In the meantime I am currently configuring my data and writing some code to automate this process.

The code is pretty simple since I'm just running the prebuilt hugging face transformer.

uSaiPrashanth commented 3 years ago

Hello after working on this for a while, I came with the following notebooks Logits without shuffling:

Logits after shuffling:

and finnally, the EDA

from this, I interpret that the order of training does not influence memorization and the variation in the mean and variance of logits is indeed within the margin of error

uSaiPrashanth commented 3 years ago

Here are the datasets which I created in the process: The pile and finally, the logits

Do note that these were logits of gpt-neo-1.3B model.

uSaiPrashanth commented 3 years ago

Another point worth noting is that data with less than 40 tokens was skipped

StellaAthena commented 3 years ago

Hello after working on this for a while, I came with the following notebooks Logits without shuffling:

Logits after shuffling:

and finnally, the EDA

from this, I interpret that the order of training does not influence memorization and the variation in the mean and variance of logits is indeed within the margin of error

This is probably a stupid question, but I’ve had a migraine for 48 hours so bear with me…. Can you walk me through what the code you just shared does? My original ask was for the number of correctly reproduced continuation tokens as a function of the index position within the training data, it looks like this might be the logits for the next 20 tokens instead? But what role does shuffling the data play?

uSaiPrashanth commented 3 years ago

Your interpretation of the notebooks is right. I've used the first 20 tokens to compare the number of correctly predicted next 20 tokens. The reasoning behind that was your statement: Count the number of subsequent tokens correctly reproduced and plot # reproduced vs the position of the document in the training data.

uSaiPrashanth commented 3 years ago

I initially assumed, by the statement: "Order of the training data", that you wanted to know if this order in which each document was trained had an influence in it's predictions. Thus, I evaluated it twice, with and without shuffling and found that the ordering of documents indeed doesn't matter.

StellaAthena commented 3 years ago

You said

I initially assumed, by the statement: "Order of the training data", that you wanted to know if this order in which each document was trained had an influence in it's predictions. Thus, I trained it twice, with and without shuffling and found that the ordering of documents indeed doesn't matter.

Are you saying that you trained a 1.3B model, from scratch, on the Pile twice? And that on average the same number of tokens are memorized regardless of training order?

uSaiPrashanth commented 3 years ago

No, I just evaluated them. and on average the same number of tokens are memorized regardless of evaluation order.

dschonholtz commented 3 years ago

@uSaiPrashanth correct me if I'm wrong but this is what I understand the current gameplan to be. You are currently running your kaggle notebook again referencing the 29.jsonl.zst file. That way you will have a measure of how many tokens are repeated after the first 20 tokens are given as input all else being equal for 00.jsonl.zst and 29.jsonl.zst. This is all with the 1.3 B model

Meanwhile, I am doing a very similar experiment with the 2.7 B model locally while varying temperature. The hope is to have data for the 1.3 B model, the 2.7 B model and to see if the temperature has a drastic impact on making a model regurgitate data in case we need to adjust the temp to get a statistically large enough sample of data points being generated verbatim.

Let me know if that makes sense or if I am missing anything.

uSaiPrashanth commented 3 years ago

After evaluating the model gpt-neo-1.3B on 0th shard, it was then evaluated on 29th shard

The following eda is obtained after evaluating it on 29th shard From what we can see, I think we can conclude that: For gpt-neo-1.3B model, the order of training doesnot influence memorization. However, there does seem to be an increase in the effective score of the model (2.827240506329114 to 2.868052738336714) and decrease of variance from (21.35283770165037 to 21.30135259762434).

This can clearly be assumed to be randomness associated with the model. But can also mean that there can be a significant difference in scores when this experiment is repeated for larger language models.

dschonholtz commented 3 years ago

I have results for the 2.7 B model. It isn't much different. Very hacky notebook is here: https://github.com/dschonholtz/EleutherProjects/blob/main/HuggingFaceSimple.ipynb

Pretty similar numbers on my end.

            Means
Temp      0 Shard       29 Shard  
  1         2.2071         2.3036      
 .7         2.5679         2.7714
 .4         2.7679         2.9071

STD Dev ranges from 4.2 to 4.61. The fact that the means do creep up a little bit in every temp, in my experiment and in @uSaiPrashanth's experiment does make me wonder if there is a very small causal effect here that isn't entirely randomness. But I can't think of a distribution or P value that would make this remotely possibly statistically significant.

One idea, is that each individual prompt has a distribution of how likely it is to be regurgitated. The vast majority of the prompts are extremely unlikely to be regurgitated, but then a small subset of them can be massively influenced by how recently they were introduced to the training data. I think this would be fairly hard to test though.

StellaAthena commented 3 years ago

Now that we have some working demos, I made a GitHub repo to organize code: https://github.com/StellaAthena/transformer-memorization

StellaAthena commented 2 years ago

@uSaiPrashanth has gotten a preliminary answer of "no" for GPT-J. Our next step is to translate the analysis to working on the GPT-NeoX codebase (possibly using the LM Eval Harness) so we can explore more model types.

EleutherAI / project-menu