Prefix LM Eval - Githubissues

Muennighoff commented 2 years ago

This PR adapts evaluation to work with Prefix LMs, such as used for T0 finetuning experiments.

Using the normal eval harness I get the following results:

Using CHECKPOINT_PATH=$six_ALL_CCFRSCRATCH/checkpoints/tr11f-6B3-ml/checkpoints/main/global_step163750 (CKPT prior to MTF): copa "acc": 0.58

Using CHECKPOINT_PATH=/gpfsscratch/rech/six/commun/checkpoints/tr13f-6B3-ml-t0/checkpoints/prefix/global_step2000: copa "acc": 0.7

Using CHECKPOINT_PATH=/gpfsscratch/rech/six/commun/checkpoints/tr13f-6B3-ml-t0/checkpoints/prefix/global_step3100: copa "acc": 0.67

Using CHECKPOINT_PATH=/gpfsscratch/rech/six/commun/checkpoints/tr13f-6B3-ml-t0/checkpoints/prefix/global_step3100 without --prefix: copa "acc": 0.73

Muennighoff commented 2 years ago

cc @lintangsutawika @haileyschoelkopf - can't add you as reviewers somehow, but would be great if you could take a look. I'm not 100% sure about the results I got 🧐

lintangsutawika commented 2 years ago

Will take a closer look.

lintangsutawika commented 2 years ago

@Muennighoff so the intended results is suppose to be that with Prefix-LM the performance should be higher, right? However, based on the scores you shared, this does not seem to be the case.

Muennighoff commented 2 years ago

Yeah so according to the current results evaluating the model as a causallm is better than a prefixlm after it was fine-tuned as a prefixlm. Also note:

In both cases it is better than prior to fine-tuning.
There is no strong performance difference for the CD + CD & CD + ND models in the What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? paper. I.e. for
```
CD:FLM (219B) + CD:MTF (13B)
CD:FLM (219B) + ND:MTF (13B
```

bigscience-workshop / Megatron-DeepSpeed

Prefix LM Eval #313