PiotrNawrot / nanoT5

Fast & Simple repository for pre-training and fine-tuning T5-style models
Apache License 2.0
965 stars 72 forks source link

Question about implementing whole word masking in nanoT5 #32

Closed brick-pid closed 7 months ago

brick-pid commented 7 months ago

First off, I want to thank you for the amazing work on creating nanoT5! This repo has helped me continue pretraining codeT5 on my own data corpus. Thanks a lot! And I had some questions about the current masking implementation in nanoT5:

  1. It seems that nanoT5 currently uses random span masking. If I want to implement the whole word masking (WWM) trick, how could I implement it?

  2. I think I should modify the DataCollatorForT5MLM class implementation. But I am not sure if I am on the correct way, could you give me some hints on where to start?

Any insights or pointers you can provide would be much appreciated!

PiotrNawrot commented 7 months ago

Hey,

Yeah, you should modify DataCollatorForT5MLM. To implement whole word masking you need the access to the text document before the tokenisation and perform pre-tokenisation in form of splitting over whitespaces. Then you should probably forward such a pre-processed sequence to the tokeniser and based on both the input and output sequence construct a mask which would tell you where are the word boundaries. Then based on this mask and the sequence of tokens you could implement whole word masking in DataCollatorForT5MLM.

Good luck, let me know if you have any other questions.