Closed jlxy closed 3 years ago
transformer comes from google's paper attention is all you need. there, no scheduled sampling is used either.
Unlike RNN-based models, transformer is trained in parallel for all words. Therefore, scheduled sampling can't be applied to transformer directly. For more details, you can refer to the paper "Scheduled Sampling for Transformers".
Hello. I notice that you didn't apply scheduled sampling when training the xtransformer model. Could you tell me the reason? Thanks.