Closed MattYoon closed 1 year ago
They were selected based on an average across seen tasks (but unseen datasets). You can find the exact list of datasets here (accuracy): https://github.com/bigscience-workshop/bigscience/blob/e848657707a549dda35c8b3cc63a96d2064b2983/evaluation/results/tr13/tzeroeval/convert_validation_7b1.slurm#L85 and here (bleu): https://github.com/bigscience-workshop/bigscience/blob/e848657707a549dda35c8b3cc63a96d2064b2983/evaluation/results/tr13/tzeroeval/convert_validation_7b1.slurm#L302
Those are each visualized in Figure 7.
Thank you for your quick response!
Hi, I thank again for your awesome work.
Your paper states that "We select the final checkpoint based on validation performance." Does the "validation performance" mean held-out performance, or seen task performance measured on their available eval subsets?
It seems like there are mixed approaches in the literatures. While T0 checkpoints were picked solely based on seen task performance, Flan-T5 checkpoints were picked based on held-out performance.
When I first read your paper, I assumed they were picked based on held-out performance, but I recently found that
prepare_xp3_train.py
saves seen task validation sets separately when available.It would help us a lot if you could please provide additional information on this. Thank you.