Question about sparse token prediction

JonasGeiping / cramming

Cramming the training of a (BERT-type) language model into limited compute.

MIT License

1.3k stars 100 forks source link

Hi Jonas,

Thanks for sharing the great work! I have a small question about the paper.

Both your paper and Izsak et al. referred to Roberta for something called "sparse token prediction", which I couldn't find in the Roberta paper. From your code, it appears that "sparse token prediction" just means that you are only calculating the loss from the positions that's masked. It seems that this should be the default setting for training an MLM (and appears to be the case in Bert's code. The situation where you turn off this sparse prediction doesn't quite make sense -- why would one want to predict the unmasked tokens? Am I missing something obvious here?

Thanks for any help!

JonasGeiping / cramming

Question about sparse token prediction #33