Were the checkpoints selected based on the held-out performance or seen task performance?

MattYoon commented 1 year ago

Hi, I thank again for your awesome work.

Your paper states that "We select the final checkpoint based on validation performance." Does the "validation performance" mean held-out performance, or seen task performance measured on their available eval subsets?

It seems like there are mixed approaches in the literatures. While T0 checkpoints were picked solely based on seen task performance, Flan-T5 checkpoints were picked based on held-out performance.

When I first read your paper, I assumed they were picked based on held-out performance, but I recently found that prepare_xp3_train.py saves seen task validation sets separately when available.

It would help us a lot if you could please provide additional information on this. Thank you.

Muennighoff commented 1 year ago

They were selected based on an average across seen tasks (but unseen datasets). You can find the exact list of datasets here (accuracy): https://github.com/bigscience-workshop/bigscience/blob/e848657707a549dda35c8b3cc63a96d2064b2983/evaluation/results/tr13/tzeroeval/convert_validation_7b1.slurm#L85 and here (bleu): https://github.com/bigscience-workshop/bigscience/blob/e848657707a549dda35c8b3cc63a96d2064b2983/evaluation/results/tr13/tzeroeval/convert_validation_7b1.slurm#L302

Those are each visualized in Figure 7.

MattYoon commented 1 year ago

Thank you for your quick response!

bigscience-workshop / xmtf

Were the checkpoints selected based on the held-out performance or seen task performance? #10