Closed deepSTEM closed 3 years ago
The performance improvement of "LabelSmoothing" for xlan is not significant. One of the reasons might be the scheduled sampling , which mitigates the overfitting as well as the discrepancy between training and inference. Thus, CrossEntropy is used for xlan for simplicity.
When I compared performances of different models, I have noticed that you use 'CrossEntropy' when the model is xlan, and you use 'LabelSmoothing' when the model is xtransformer. Why do you differentiate them? Is it not proper to adopt 'LabelSmoothing' in xlan?