alexa / massive

Tools and Modeling Code for the MASSIVE dataset
Other
538 stars 57 forks source link

Will the rule "dev splits of the MASSIVE dataset may not be used for model training" makes the competition unfair for individual contestants #32

Open yichaopku opened 2 years ago

yichaopku commented 2 years ago

Though dev splits of the MASSIVE dataset may not be used for model training, but it can be used for hyperparamer tuning. Tuning hyperparamters with effective hyperparamer searching algorithms is essentially training with dev splits in some extent, especially for those with many gpus. So, for the full dataset competition, with more gpus, contestants can use more training data(the dev split). For the zero-shot competition, with more gpus, contestants can use non-english labelled data(the dev split) indirectly. Maybe it is unfair for individual contestants(those with no enough gpu resources) compared with those who stands for a lab or company(usually they have more gpu resources). Maybe merge the train and dev splits as train split and let contestants to split the train-set as train and dev by themselves for hyper-parameter tuning is better and fair for everyone.

jgmf-amazon commented 2 years ago

Hi @yichaopku , greetings. Yes, these are all excellent points, and we debated this topic extensively prior to launch. Ultimately we decided to maintain backward compatibility with SLURP, particularly so researchers could use its English audio data in a standardized way. SLURP does have an explicit dev split. Additionally, as a counterpoint I’d say that those with more compute will still have an advantage even if the train and dev splits were merged. Moreover, there isn’t really a way around the issue you mentioned of learning from the non-English dev set via hyperparameter tuning for zero-shot, other than totally hiding all non-English data behind an evaluation interface like eval.ai. Our ultimate goal was to provide folks with a comprehensive dataset for their varied research goals. Thus, we wanted to hide/exclude the least amount of data possible.

All that said, we are certainly open to future competitions that could somehow level the playing field. We have the model parameter limit for MMNLU-22, which we could lower even further for a future edge device, low-latency, and low-memory application-centric competition. If you have other ideas of how to reduce the disadvantage for individual researchers, we’d love to hear them.

A quick final point: The organizers’ choice award will be based primarily on our assessment of the promise of a given approach, not purely on the score. Thus, if you have an idea for a new modeling technique, even if it doesn’t have the highest score today, we’d love to hear about it and we’d love to see your submission.

Thanks for the feedback and engagement!