Great job. R-Drop forces the output distributions of different sub models generated by dropout to be consistent with each other. So can mseloss replace KL divergence?Looking forward to your reply.
Yes, as we presented in appendix A.4, the STS-B task in GLUE is a regression task, therefore MSE regularization is required. You can check appendix A.4 for the simple extension.
Great job. R-Drop forces the output distributions of different sub models generated by dropout to be consistent with each other. So can mseloss replace KL divergence?Looking forward to your reply.