Closed WilsonWangTHU closed 4 months ago
Thank you for your interest.
During inference, the model predicts masked tokens iteratively. If it makes a wrong prediction in the early iterations, it can negatively impact predictions in later iteration. To minimize this accumulated error, we added noise during training by randomly replacing some input tokens with random tokens. This helps the model to be robust to the error that might happen during inference.
ty! that makes a lot of sense
Really like this work!
I notice that during training, 50% of the tokens are randomly assigned to another code before applying the masking process here.
Is this intended? And I wonder what the intuition for that is.