Closed snsun closed 2 years ago
Hi, I also meet this problem. Have you solved it?
No, in my task, the KL Loss drops very fast so it seems that it can not be helpful.
In my task , the kl loss also drops fast, so i set the alpha at 4, the model is stll training .
In my task , the kl loss also drops fast, so i set the alpha at 4, the model is stll training .
So will r-drop be helpful in your task?
Hi, as I mentioned in the title, did you find that the KLD part would converge very fast and the value of KLD loss is very small after several steps?
As for your question, different tasks may obtain inconsistent conclusions. For example, The KL loss was reduced from 0.7 to 0.15 in the NMT task. In my opinion, a possible criterion for judging the effectiveness of R-DROP is Whether or not a submodel adapt dropout has a significant impact on training. For example, the KL loss will gradually increase in the baseline without R-Drop.
Because of inference twice, the first outputs is modified by an inplace,there will be a error in loss.backward().So I used .detach() for the first output.If i do this,the result is strange.The cross entropy loss is always very large. My code is as follows:
features, conv_features = backbone(inputs)
outputs, original_logits = head(features, labels)
outputs = outputs.detach()
features2, conv_features2= backbone(inputs)
outputs2, original_logits2 = head(features2, labels)
loss_ce = criterion_ce(outputs2, labels)
loss_kl = 4*criterion_kl(outputs, outputs2)
loss = loss_ce + loss_kl
Because of inference twice, the first outputs is modified by an inplace,there will be a error in loss.backward().So I used .detach() for the first output.If i do this,the result is strange.The cross entropy loss is always very large. My code is as follows:
features, conv_features = backbone(inputs) outputs, original_logits = head(features, labels) outputs = outputs.detach() features2, conv_features2= backbone(inputs) outputs2, original_logits2 = head(features2, labels) loss_ce = criterion_ce(outputs2, labels) loss_kl = 4*criterion_kl(outputs, outputs2) loss = loss_ce + loss_kl
It is strange for the problem you mentioned,and I'm very sure forward twice can work for any task。 I think the issue "the first outputs are modified by an in-place operation" is not caused by R-Drop。 Also, I strongly discourage using "detach() " which may hurt the performance.
Because of inference twice, the first outputs is modified by an inplace,there will be a error in loss.backward().So I used .detach() for the first output.If i do this,the result is strange.The cross entropy loss is always very large. My code is as follows:
features, conv_features = backbone(inputs) outputs, original_logits = head(features, labels) outputs = outputs.detach() features2, conv_features2= backbone(inputs) outputs2, original_logits2 = head(features2, labels) loss_ce = criterion_ce(outputs2, labels) loss_kl = 4*criterion_kl(outputs, outputs2) loss = loss_ce + loss_kl
It is strange for the problem you mentioned,and I'm very sure forward twice can work for any task。 I think the issue "the first outputs are modified by an in-place operation" is not caused by R-Drop。 Also, I strongly discourage using "detach() " which may hurt the performance. For the first and second inference,the frozen neurons is different,so which neurons will be update in back propagetion?
Close the session since there is no further question.
Hi, as I mentioned in the title, did you find that the KLD part would converge very fast and the value of KLD loss is very small after several steps?