dropreg / R-Drop

870 stars 107 forks source link

Inconsistency for KL loss and CE loss hyper-parameters and baselines results in GLUE #6

Closed zhangzhenyu13 closed 3 years ago

zhangzhenyu13 commented 3 years ago

Inconsistency exits in the code of bert_modeling and roberta_modeling files, i.e. bert loss is like this--> ce(logits1, labels)+ce(logits2,labels)+ 0.5/2.0(kl(logits1, logits2)+kl(logits2, logits1)), where alpha in paper is 0.5 here and that in roberta loss is like this--> 0.5(ce(logits1, labels)+ce(logits2,labels))+ 0.7/2.0*(kl(logits1, logits2)+kl(logits2, logits1)), where alpha in paper is 0.7 and ce loss also aeveraged ### What are the tricks here???

In BERT ` alpha = 1.0 for logits in logits_list: if labels is not None: if self.num_labels == 1:

We are doing regression

                loss_fct = MSELoss()
                if loss:
                    loss += alpha * loss_fct(logits.view(-1), labels.view(-1))
                else:
                    loss = alpha * loss_fct(logits.view(-1), labels.view(-1))
            else:
                loss_fct = CrossEntropyLoss()
                if loss:
                    loss += alpha * loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
                else:
                    loss = alpha * loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

    if loss is not None:
        if self.num_labels == 1:
            loss_fct = MSELoss()
            loss += 1.0 * loss_fct(logits_list[0].view(-1), logits_list[-1].view(-1))
        else:
            p = torch.log_softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
            p_tec = torch.softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
            q = torch.log_softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)
            q_tec = torch.softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)

            kl_loss = torch.nn.functional.kl_div(p, q_tec, reduction='none').sum()
            reverse_kl_loss = torch.nn.functional.kl_div(q, p_tec, reduction='none').sum()
            loss += 0.5 * (kl_loss + reverse_kl_loss) / 2.`

In Roberta ` loss = None if labels is not None: if self.num_labels == 1:

We are doing regression

            loss_fct = MSELoss()
            if loss is None:
                loss = 0.5 * loss_fct(logits_list[0].view(-1), labels.view(-1))
            else:
                loss += 0.5 * loss_fct(logits_list[-1].view(-1), labels.view(-1))
        else:
            loss_fct = CrossEntropyLoss()
            if loss is None:
                loss = 0.5 * loss_fct(logits_list[0].view(-1, self.num_labels), labels.view(-1))
            else:
                loss += 0.5 * loss_fct(logits_list[-1].view(-1, self.num_labels), labels.view(-1))

    if loss is not None:
        if self.num_labels == 1:
            loss_fct = MSELoss()
            loss += 0.8 * loss_fct(logits_list[0].view(-1), logits_list[-1].view(-1))
        else:
            p = torch.log_softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
            p_tec = torch.softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
            q = torch.log_softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)
            q_tec = torch.softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)

            kl_loss = torch.nn.functional.kl_div(p, q_tec, reduction='none')
            reverse_kl_loss = torch.nn.functional.kl_div(q, p_tec, reduction='none')

            loss += 0.7 * (kl_loss.sum() + reverse_kl_loss.sum()) / 2`
dropreg commented 3 years ago

It is critical to choose hyper-parameter reg_alpha for the pre-training model, and the results are shown in paper. No more empirical guidance here. More experiments may be needed to find the best reg_alpha.

zhangzhenyu13 commented 3 years ago

The originial paper do not apply any hyper-parameters to the NLL loss parts (both nll(p1, y) and nll(p2, y)). But in your code, you apply parameters to those parts and they are different for different encoders. Besides, following the your code snippts, the scores of GLUE are not as high as in the paper, both for baselines and r-drop (Though the improvement is observed when applying r-drop). The baselines are reported by transformers' (the code lib you used to implement the encoder) also report different results for baselines, refer here https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification. The huggingface official args compared to yours are that they do not apply warm-up but your code do.------- however the results after applying warmup is slightly better than yours. Any suggestions about the proper run? Thanks.

dropreg commented 3 years ago

The originial paper do not apply any hyper-parameters to the NLL loss parts (both nll(p1, y) and nll(p2, y)). But in your code, you apply parameters to those parts and they are different for different encoders. Besides, following the your code snippts, the scores of GLUE are not as high as in the paper, both for baselines and r-drop (Though the improvement is observed when applying r-drop). The baselines are reported by transformers' (the code lib you used to implement the encoder) also report different results for baselines, refer here https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification. The huggingface official args compared to yours are that they do not apply warm-up but your code do.------- however the results after applying warmup is slightly better than yours. Any suggestions about the proper run? Thanks.

Hi zhangzhenyu13.

For the question: "following the your code snippts, the scores of GLUE are not as high as in the paper", I'm not sure about your hyper-parameter is as described in the paper. If you do that, I think it may be something wrong like random seed. At the same time, we also give the possible fluctuations with different random seed in original paper.

image

For the next "The huggingface official args compared to yours are that they do not apply warm-up but your code do.", we need to know that the huggingface official script is just a demo, and we refer to paper "BETTER FINE-TUNING BY REDUCING REPRESENTATIONAL COLLAPSE" to choose our hyper-parameter, but it's not exactly the same.

image

zhangzhenyu13 commented 3 years ago

Thanks, It is clearly observed that when the hyperparameters slightly change, the scores vary . However, setting HP as lr=2e-5, bs=32, no-warmup/with warmup, epoch=3(except that 6 for rdrop) is enough for all tasks. The rand seed is crucial to the results. I sampled 10 random seeds (0, 100) as 22,83,46,14,3,28,33,69,93,40 and I get the best scores even better than yours.

BERT-base 0.8624 R-DROP(bert-base) 0.8747

beyondguo commented 2 years ago

Thanks, It is clearly observed that when the hyperparameters slightly change, the scores vary . However, setting HP as lr=2e-5, bs=32, no-warmup/with warmup, epoch=3(except that 6 for rdrop) is enough for all tasks. The rand seed is crucial to the results. I sampled 10 random seeds (0, 100) as 22,83,46,14,3,28,33,69,93,40 and I get the best scores even better than yours.

BERT-base 0.8624 R-DROP(bert-base) 0.8747

Hi @zhangzhenyu13 @dropreg Are these results evaluated on the dev set rather than the test set?

In the paper, it said:

We further evaluate our proposed approach on the language understanding tasks by fine- tuning the pre-trained models3, which are the standard development sets of GLUE [63] benchmark.

Also the huggingface demo https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification also said the results are from the dev set.

So i'm a little confused why not report the results on test set?