Problem in reproduce multi-distillation approach

kongds commented 1 year ago

Hello, Thank you for providing code.

I can get the right results of W1A1 with bash scripts/run_glue.sh MNLI (around 77 accuracy on MNLI)

But when i reproduce the W1A1 with multi-distillation approach following (W32A32->W1A2->W1A1), I cannot reproduce the results of W1A2 in paper by simply change abits=1 to abits=2 in scripts/run_glue.sh (The result of W1A2 i get is 80.96/81.36).

Can you share the detail settings of multi-disitillation approach?

TTTTTTris commented 1 year ago

Hello, I‘ve met the same problem, but I could not get the right results for W1A1 (around 52 accuracy on RTE), and when I try to train W1A2, the result is worse (50%). May I ask if you tried to reproduce RTE?

kongds commented 1 year ago

I don't run RTE. But i have reproduced STS-B. The result of W1A1 is around 67.0 compared to 71.1 in paper.

TTTTTTris commented 1 year ago

The results of STS-B are 67.7(W1A1 w/o multi-distill), 73.5(W1A2), and 58.0(W1A1 W multi-distill), still lower than the paper. I didn't use data parallel. ------------------ Original ------------------ From: @.>; Date: Wed, Nov 30, 2022 10:59 PM To: @.>; Cc: @.>; @.>; Subject: Re: [facebookresearch/bit] Problem in reproduce multi-distillation approach (Issue #2)

I don't run RTE. But i have reproduced STS-B. The result of W1A1 is around 66.6 compared to 71.1 in paper.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

kongds commented 1 year ago

It seems that we cannot reproduce the result of STS-B. The settings of STS-B are: https://github.com/facebookresearch/bit/blob/071a9749e024e8e151c55adbeb6ef3aaf5b8a283/utils_glue.py#L689 According to paper, authors use grid searching to get the result of STS-B

NicoNico6 commented 1 year ago

Hello, I‘ve met the same problem, I also could not get the right results for W1A1 STS-B (around 68 compared to 71 reported in the paper). May I ask whether you have figured out the reason? @kongds

kongds commented 1 year ago

Hi, I still can't get the correct result for W1A1 STS-B and don't know why.

NicoNico6 commented 1 year ago

That is also difficult for me. I have also tried most W1A2 experiments (with a clear accuracy gap) and want to cite and compare BiT in my paper, but the accuracy gap now really confuses me.

TTTTTTris commented 1 year ago

I can not get the accuracy shown in the paper in most w1a2 or w1a4 tasks and the accuracy gap is about 10 points.

NicoNico6 commented 1 year ago

I can not get the accuracy shown in the paper in most w1a2 or w1a4 tasks and the accuracy gap is about 10 points.

Maybe the released version is not the optimal version.

Phuoc-Hoan-Le commented 1 year ago

I can reproduce the 1-1-1 BERT for all datasets without multi-distillation. But for 1-1-4 and 1-1-2 BERT, my results are way off. Is anyone @kongds @NicoNico6 @TTTTTTris @likethesky @Celebio getting the same thing?

NicoNico6 commented 1 year ago

I can reproduce the 1-1-1 BERT for all datasets without multi-distillation. But for 1-1-4 and 1-1-2 BERT, my results are way off. Is anyone @kongds @NicoNico6 @TTTTTTris @likethesky @Celebio getting the same thing?

Hi, I also found this problem.

Besides, I tried to evaluate the released pre-trained model, but I can not get ACC reported in the README Table. For example, when data augmentation is used, the reported ACC of the released pretrained model is RTE:69.7, MRPC: 88, STS-B: 84.2.

However, I tried running an evaluation based on the released by myself, the corresponding performance is RTE: 66 vs 69.7, MRPC: 85.5 vs 88, STS-B 82.3 vs 84.2.

Did you find the same issue?

Phuoc-Hoan-Le commented 1 year ago

I can reproduce the 1-1-1 BERT for all datasets without multi-distillation. But for 1-1-4 and 1-1-2 BERT, my results are way off. Is anyone @kongds @NicoNico6 @TTTTTTris @likethesky @Celebio getting the same thing?

Hi, I also found this problem.

Besides, I tried to evaluate the released pre-trained model, but I can not get ACC reported in the README Table. For example, when data augmentation is used, the reported ACC of the released pretrained model is RTE:69.7, MRPC: 88, STS-B: 84.2.

However, I tried running an evaluation based on the released by myself, the corresponding performance is RTE: 66 vs 69.7, MRPC: 85.5 vs 88, STS-B 82.3 vs 84.2.

Did you find the same issue?

Have you tried doing grid search for hyper parameters and see if it works?

facebookresearch / bit

Problem in reproduce multi-distillation approach #2