dropreg / R-Drop

867 stars 107 forks source link

Will KLD loss degrease very fast? #5

Closed snsun closed 2 years ago

snsun commented 3 years ago

Hi, as I mentioned in the title, did you find that the KLD part would converge very fast and the value of KLD loss is very small after several steps?

HillZhang1999 commented 3 years ago

Hi, I also meet this problem. Have you solved it?

snsun commented 3 years ago

No, in my task, the KL Loss drops very fast so it seems that it can not be helpful.

a304628356 commented 3 years ago

In my task , the kl loss also drops fast, so i set the alpha at 4, the model is stll training .

HillZhang1999 commented 3 years ago

In my task , the kl loss also drops fast, so i set the alpha at 4, the model is stll training .

So will r-drop be helpful in your task?

dropreg commented 3 years ago

Hi, as I mentioned in the title, did you find that the KLD part would converge very fast and the value of KLD loss is very small after several steps?

As for your question, different tasks may obtain inconsistent conclusions. For example, The KL loss was reduced from 0.7 to 0.15 in the NMT task. In my opinion, a possible criterion for judging the effectiveness of R-DROP is Whether or not a submodel adapt dropout has a significant impact on training. For example, the KL loss will gradually increase in the baseline without R-Drop.

liuyiyiyiyi commented 3 years ago

Because of inference twice, the first outputs is modified by an inplace,there will be a error in loss.backward().So I used .detach() for the first output.If i do this,the result is strange.The cross entropy loss is always very large. My code is as follows:

    features, conv_features = backbone(inputs)
    outputs, original_logits = head(features, labels) 
    outputs = outputs.detach()
    features2, conv_features2= backbone(inputs)         
    outputs2, original_logits2 = head(features2, labels)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
    loss_ce = criterion_ce(outputs2, labels)
    loss_kl = 4*criterion_kl(outputs, outputs2)
    loss = loss_ce + loss_kl
dropreg commented 3 years ago

Because of inference twice, the first outputs is modified by an inplace,there will be a error in loss.backward().So I used .detach() for the first output.If i do this,the result is strange.The cross entropy loss is always very large. My code is as follows:

    features, conv_features = backbone(inputs)
    outputs, original_logits = head(features, labels) 
    outputs = outputs.detach()
    features2, conv_features2= backbone(inputs)         
    outputs2, original_logits2 = head(features2, labels)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
    loss_ce = criterion_ce(outputs2, labels)
    loss_kl = 4*criterion_kl(outputs, outputs2)
    loss = loss_ce + loss_kl

It is strange for the problem you mentioned,and I'm very sure forward twice can work for any task。 I think the issue "the first outputs are modified by an in-place operation" is not caused by R-Drop。 Also, I strongly discourage using "detach() " which may hurt the performance.

liuyiyiyiyi commented 3 years ago

Because of inference twice, the first outputs is modified by an inplace,there will be a error in loss.backward().So I used .detach() for the first output.If i do this,the result is strange.The cross entropy loss is always very large. My code is as follows:

    features, conv_features = backbone(inputs)
    outputs, original_logits = head(features, labels) 
    outputs = outputs.detach()
    features2, conv_features2= backbone(inputs)         
    outputs2, original_logits2 = head(features2, labels)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
    loss_ce = criterion_ce(outputs2, labels)
    loss_kl = 4*criterion_kl(outputs, outputs2)
    loss = loss_ce + loss_kl

It is strange for the problem you mentioned,and I'm very sure forward twice can work for any task。 I think the issue "the first outputs are modified by an in-place operation" is not caused by R-Drop。 Also, I strongly discourage using "detach() " which may hurt the performance. For the first and second inference,the frozen neurons is different,so which neurons will be update in back propagetion?

apeterswu commented 2 years ago

Close the session since there is no further question.