PiotrNawrot / nanoT5

Fast & Simple repository for pre-training and fine-tuning T5-style models
Apache License 2.0
970 stars 74 forks source link

About Pre-training objectives #38

Closed Respaired closed 5 months ago

Respaired commented 5 months ago

Hi, Thanks for giving us this implementation. I really appreciate it. I'm a bit new to training Enc-Dec models. so I was wondering if you could answer this one question.

If my understanding is correct, the regular pre-training objective of the T5 is very similar to MLM, as in you mask some tokens and have the model learn to predict them. so I want to know if, say, instead of masking tokens, I corrupt my whole dataset (20% of each row) by replacing the tokens with other tokens (not using any fancy generator-discriminator, just corrupting the data during the pre-processing step) and treat it like a grammar / typo correction task where the labels are the original, clean text itself; could be a viable objective?

input:"the katt jamped over the fense" label: "the cat jumped over the fence"

may I ask you to tell me what you think on this?

PiotrNawrot commented 5 months ago

Hey,

It sounds all good to me, I can't tell if this would work better or worse than regular MLM used in T5 but you could try it!

Good luck!