bigscience-workshop / xmtf

Crosslingual Generalization through Multitask Finetuning
https://arxiv.org/abs/2211.01786
Apache License 2.0
516 stars 37 forks source link

Were the checkpoints selected based on the held-out performance or seen task performance? #10

Closed MattYoon closed 1 year ago

MattYoon commented 1 year ago

Hi, I thank again for your awesome work.

Your paper states that "We select the final checkpoint based on validation performance." Does the "validation performance" mean held-out performance, or seen task performance measured on their available eval subsets?

It seems like there are mixed approaches in the literatures. While T0 checkpoints were picked solely based on seen task performance, Flan-T5 checkpoints were picked based on held-out performance.

When I first read your paper, I assumed they were picked based on held-out performance, but I recently found that prepare_xp3_train.py saves seen task validation sets separately when available.

It would help us a lot if you could please provide additional information on this. Thank you.

Muennighoff commented 1 year ago

They were selected based on an average across seen tasks (but unseen datasets). You can find the exact list of datasets here (accuracy): https://github.com/bigscience-workshop/bigscience/blob/e848657707a549dda35c8b3cc63a96d2064b2983/evaluation/results/tr13/tzeroeval/convert_validation_7b1.slurm#L85 and here (bleu): https://github.com/bigscience-workshop/bigscience/blob/e848657707a549dda35c8b3cc63a96d2064b2983/evaluation/results/tr13/tzeroeval/convert_validation_7b1.slurm#L302

Those are each visualized in Figure 7.

MattYoon commented 1 year ago

Thank you for your quick response!