Closed brick-pid closed 7 months ago
Hey,
Yeah, you should modify DataCollatorForT5MLM
. To implement whole word masking you need the access to the text document before the tokenisation and perform pre-tokenisation in form of splitting over whitespaces. Then you should probably forward such a pre-processed sequence to the tokeniser and based on both the input and output sequence construct a mask which would tell you where are the word boundaries. Then based on this mask and the sequence of tokens you could implement whole word masking in DataCollatorForT5MLM
.
Good luck, let me know if you have any other questions.
First off, I want to thank you for the amazing work on creating nanoT5! This repo has helped me continue pretraining codeT5 on my own data corpus. Thanks a lot! And I had some questions about the current masking implementation in nanoT5:
It seems that nanoT5 currently uses random span masking. If I want to implement the whole word masking (WWM) trick, how could I implement it?
I think I should modify the
DataCollatorForT5MLM
class implementation. But I am not sure if I am on the correct way, could you give me some hints on where to start?Any insights or pointers you can provide would be much appreciated!