bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.31k stars 213 forks source link

Prefix LM Eval #313

Open Muennighoff opened 2 years ago

Muennighoff commented 2 years ago

This PR adapts evaluation to work with Prefix LMs, such as used for T0 finetuning experiments.

Using the normal eval harness I get the following results:

Using CHECKPOINT_PATH=$six_ALL_CCFRSCRATCH/checkpoints/tr11f-6B3-ml/checkpoints/main/global_step163750 (CKPT prior to MTF): copa "acc": 0.58

Using CHECKPOINT_PATH=/gpfsscratch/rech/six/commun/checkpoints/tr13f-6B3-ml-t0/checkpoints/prefix/global_step2000: copa "acc": 0.7

Using CHECKPOINT_PATH=/gpfsscratch/rech/six/commun/checkpoints/tr13f-6B3-ml-t0/checkpoints/prefix/global_step3100: copa "acc": 0.67

Using CHECKPOINT_PATH=/gpfsscratch/rech/six/commun/checkpoints/tr13f-6B3-ml-t0/checkpoints/prefix/global_step3100 without --prefix: copa "acc": 0.73

Muennighoff commented 2 years ago

cc @lintangsutawika @haileyschoelkopf - can't add you as reviewers somehow, but would be great if you could take a look. I'm not 100% sure about the results I got 🧐

lintangsutawika commented 2 years ago

Will take a closer look.

lintangsutawika commented 2 years ago

@Muennighoff so the intended results is suppose to be that with Prefix-LM the performance should be higher, right? However, based on the scores you shared, this does not seem to be the case.

Muennighoff commented 2 years ago

Yeah so according to the current results evaluating the model as a causallm is better than a prefixlm after it was fine-tuned as a prefixlm. Also note: