run_eval changes the results depending on the batch size

kkawamu1 commented 2 years ago

System info:

I used Google Colab free to test the evaluation code.

Reproduction:

!git clone https://github.com/bigscience-workshop/t-zero.git
cd ./t-zero/
pip install -e .
!python ./evaluation/run_eval.py  --dataset_name super_glue --dataset_config_name cb --template_name "GPT-3 style" --model_name_or_path gpt2 --output_dir ./debug --per_device_eval_batch_size 1
!python ./evaluation/run_eval.py  --dataset_name super_glue --dataset_config_name cb --template_name "GPT-3 style" --model_name_or_path gpt2 --output_dir ./debug --per_device_eval_batch_size 2

Expected behavior:

accuracy scores for the two executions of the evaluation script should be the same. i.e. regardless of the batch size, the script should spit out the same number for the accuracy

However, I get Result: {'accuracy': 0.39285714285714285} for the batch size=1, but Result: {'accuracy': 0.4107142857142857}

I suspect this has to do with how the padding is handled in DecoderModel. https://github.com/bigscience-workshop/t-zero/blob/master/t0/model.py#L93

When batch > 1, the shorter texts will be padded to the longest text in the batch. i.e. some elements in batch["input_ids"] will contain pad tokens. Therefore, when concatenating input_ids and labels with "input_ids": torch.cat([batch["input_ids"], batch["labels"]], dim=-1), some of final input to the DecoderModel will look like "T-zero is awesome. Is this true or false? \<pad>\<pad>\<pad>True". I see that this is supposed to be handled by appropriately setting the position_ids position_ids = torch.cumsum(model_inputs["attention_mask"].to(torch.long), dim=-1) - 1.

What I found was that this pad middle and set position_ids strategy does NOT give the same result as when there is no padding in the middle. Please see: https://colab.research.google.com/drive/1-Bw3-ODDLrEvP75xIzC8wlJmvB7mQqTg?usp=sharing

In a short summary, it looks like the logits for the first token of the label sentence will be different if you have pad tokens in the middle.

Note: When batch==1, there will no padding tokens in batch["input_ids"] since all the sentences in batch["input_ids"] are the same. Therefore, no special handling is done here. So I assume that the value when batch_size=1 gives the correct number.

VictorSanh commented 2 years ago

Hi @kkawamu1 , yes you are correct. @richardbaihe noticed the same thing: https://github.com/bigscience-workshop/t-zero/issues/27 Would you like to open a PR to fix that? i won't have the bandwidth to have a look before next week

VictorSanh commented 2 years ago

👀 @thomasw21

bigscience-workshop / t-zero