Question about implementing whole word masking in nanoT5

PiotrNawrot / nanoT5

Fast & Simple repository for pre-training and fine-tuning T5-style models

Apache License 2.0

965 stars 72 forks source link

Hey,

Yeah, you should modify DataCollatorForT5MLM. To implement whole word masking you need the access to the text document before the tokenisation and perform pre-tokenisation in form of splitting over whitespaces. Then you should probably forward such a pre-processed sequence to the tokeniser and based on both the input and output sequence construct a mask which would tell you where are the word boundaries. Then based on this mask and the sequence of tokens you could implement whole word masking in DataCollatorForT5MLM.

Good luck, let me know if you have any other questions.

PiotrNawrot / nanoT5

Question about implementing whole word masking in nanoT5 #32