Inconsistency between valid ppl from TensorBoard and eval_lm.py

danielsimig commented 2 years ago

During the training of a 125M model I observe a relatively smooth valid ppl curve, with some minor jumps. For example, between steps 100K and 156K of the training, valid/redditflattened ppl shown on Tensorboard goes from ~40 to ~39.

If I run eval_lm.py on the above two snapshots on the very same validation sets (local changes: P490864569, command and output P490866721), I get very different numbers: 45 and 630 for consolidated checkpoints from step 100K and 156K respectively.

When I run gpt3_eval, average perplexities on the correct prompt ("ppl_answer_correct_gold" field from the results json) follow a pattern similar to eval_lm: they go from ~200 to ~1600 between these two checkpoints.

We reproduced the same results on AWS and Azure independently with @punitkoura and at this point we're unsure what's going on.

We would like to either: 1) Get someone familiar with the code say that this does not impact model evaluation with gpt3_eval and as such this is a low-pri issue 2) Or have someone help debugging why this is happening and potentially fix any inconsistency

tbmihailov commented 2 years ago

The gptz models use different eos and other tokens and the streaming LM has hardcoded some I believe. For evaluation we had a similar problem and I implemented this hacky task language_modeling_inference_for_models_trained_with_streaming. You can change the streaming task to this one or make eval_lm to work with the streaming LM task.

punitkoura commented 2 years ago

The gptz models use different eos and other tokens and the streaming LM has hardcoded some I believe. For evaluation we had a similar problem and I implemented this hacky task language_modeling_inference_for_models_trained_with_streaming. You can change the streaming task to this one or make eval_lm to work with the streaming LM task.

Ohh good point! I knew we use this task in evals, but was not sure about the reason. Thanks for the context @tbmihailov !

danielsimig commented 2 years ago

language_modeling_inference_for_models_trained_with_streaming works with the old indexed dataset format and not the jsonl based one that streaming language modeling is using, so testing eval_lm with that is non-trivial. At the same time, this could indeed explain the difference we're seeing here.

So given my limited bandwith I'm leaning towards keeping this as low-pri - unless someone has an easy fix for using this task in eval_lm with JSONL dataset?

stephenroller commented 2 years ago

Context from 1:1 chat

danielsimig commented 2 years ago

Here one unsuccessful attempt at understanding this issue:

Created a dummy validation set consisting of 100 documents
Added log statements right before calling the model to show the token_ids passed in
Run a mini-training run and collected the run
Repeated the same using eval_lm on the same dummy validation set
Compared the two outputs both in terms of token ids and in raw text (using the bpe dict and a notebook)

Could not find any obvious difference apart from the fact that texts were shuffled differently - which definitely doesn't explain the huge differences I mentioned earlier.

danielsimig commented 2 years ago

The gptz models use different eos and other tokens and the streaming LM has hardcoded some I believe. For evaluation we had a similar problem and I implemented this hacky task language_modeling_inference_for_models_trained_with_streaming. You can change the streaming task to this one or make eval_lm to work with the streaming LM task.

For the record, this was discussed offline and we concluded this is not the issue and using the same task for eval_lm as the one used at training time (streaming_language_model) should be the right way.

punitkoura commented 2 years ago

I spent two days on getting eval_lm to run the same code path as train.py, but am still getting different results as compared to the training logs.

Script - P496620919

Command

[punitkoura@ip-0A1E0404 metaseq](main)$ srun python metaseq_cli/eval_lm.py /data/xlmg/gptz/corpus_dedup_10_10_1_0.05_exp29/ --path $model_path --batch-size 4 --tokens-per-sample 2048 --valid-subset valid/redditflattened --task streaming_language_modeling --vocab-filename /data/xlmg/gptz/tokenizers/gpt2-vocab.json --merges-filename /data/xlmg/gptz/tokenizers/gpt2-merges.txt --criterion vocab_parallel_cross_entropy

facebookresearch / metaseq

Inconsistency between valid ppl from TensorBoard and eval_lm.py #17