GANPerf / LCR

Other
29 stars 4 forks source link

The result in paper could not be reproduced #6

Open eafn opened 1 year ago

eafn commented 1 year ago

Hi! I had trained the model using your code, but the results in paper could not be reproduced. The more training and test details need to be released, pls.

GANPerf commented 1 year ago

Hi! I had trained the model using your code, but the results in paper could not be reproduced. The more training and test details need to be released, pls.

Hi, I appreciate your interest. Could you kindly provide a detailed explanation regarding which specific results cannot be reproduced and on which dataset? This information will greatly assist in addressing the issue effectively. Thank you.

eafn commented 1 year ago

Hi! I had trained the model using your code, but the results in paper could not be reproduced. The more training and test details need to be released, pls.

Hi, I appreciate your interest. Could you kindly provide a detailed explanation regarding which specific results cannot be reproduced and on which dataset? This information will greatly assist in addressing the issue effectively. Thank you.

Specifically, when I tried to reproduce the baseline results (only IN1K-pretrained resnet50 without any ssl-pretrained) using main_lincls.py, I found that the baseline results were significantly different from those reported in the paper. According to the given parameter settings(lr=30 bs=256 wd=0 schedule=[60,80]), the Linear Evaluation results only reached 63.85, while the KNN results( with setting topk=200 t=0.1) were 46.37. This discrepancy from the experimental results reported in the paper is quite substantial. Despite adjusting various learning rates and batch sizes(lr=0.1 bs=256), the highest Linear Evaluation result I have been able to achieve thus far is only 65.68.

eafn commented 1 year ago

In addition,Linear Evaluation both on 2 an 4 GPUS have the same result.

GANPerf commented 1 year ago

Hi! I had trained the model using your code, but the results in paper could not be reproduced. The more training and test details need to be released, pls.

Hi, I appreciate your interest. Could you kindly provide a detailed explanation regarding which specific results cannot be reproduced and on which dataset? This information will greatly assist in addressing the issue effectively. Thank you.

Specifically, when I tried to reproduce the baseline results (only IN1K-pretrained resnet50 without any ssl-pretrained) using main_lincls.py, I found that the baseline results were significantly different from those reported in the paper. According to the given parameter settings(lr=30 bs=256 wd=0 schedule=[60,80]), the Linear Evaluation results only reached 63.85, while the KNN results( with setting topk=200 t=0.1) were 46.37. This discrepancy from the experimental results reported in the paper is quite substantial. Despite adjusting various learning rates and batch sizes(lr=0.1 bs=256), the highest Linear Evaluation result I have been able to achieve thus far is only 65.68.

I appreciate your interest. Could you please let me know which checkpoint you have been using? How about the acc1 performance on StanfordCars & Aircraft? To attain the highest acc1 accuracy, it is advisable to consider selecting the checkpoint with the best retrieval performance, rather than relying solely on the last epoch during the pretraining process. For your reference, I have provided the checkpoint at the following link: Checkpoint Link

eafn commented 1 year ago

Thank you for your reply. I only tested on CUB and only tested the performance of the baseline (only IN1k-pretrained) instead of your model. I don't quite understand why the retrieval rank1 baseline results are so different. My result is 46 while the reported result in the paper is 10.65. Does the Retrieval refer to the KNN classification (topk=200)?

eafn commented 1 year ago

Besides, the Linear Evaluation Result is the highest performance during the LR training. And the KNN result without training.

GANPerf commented 1 year ago

Besides, the Linear Evaluation Result is the highest performance during the LR training. And the KNN result without training.

It appears that there is a distinction between our retrieval rank 1 metric and the KNN approach. In our rank 1 metric, we consider the anchor and the feature with the smallest distance as sharing the same label. Hence, the rule of the minority obeying the majority does not apply in this case.

eafn commented 1 year ago

Are you suggesting that topk setting is 1 in KNN with rank 1 metric? I have tried that yesterday, but the result is 43.

eafn commented 1 year ago

Could you please provide the retrieval rank 1 metric code? Thx.

GANPerf commented 1 year ago

Could you please provide the retrieval rank 1 metric code? Thx.

already provided

eafn commented 1 year ago

thx, i will try.

eafn commented 1 year ago

Alright, I have figured out what's going on. There is a fatal flaw in your retrieval experiment process.

Specifically, when testing the baseline with the provided validation code, the rank 1 accuracy is 48. However, if the nn.functional.normalize function is removed from the model's inference function, the accuracy drops to 10.5, which is consistent with the results reported in your paper.

In summary, you did not normalize the features when validating the baseline , but did normalize them when validating your method, which is the root cause of the problem. Retrieval experiments/KNN based on cosine similarity require feature normalization. You can try to verify what I've said. If it's correct, I hope you can update your experimental results on arxiv, as this will seriously affect future research efforts.

Thank you for taking the time to consider my request, and I look forward to hearing from you soon.

GANPerf commented 1 year ago

Alright, I have figured out what's going on. There is a fatal flaw in your retrieval experiment process.

Specifically, when testing the baseline with the provided validation code, the rank 1 accuracy is 48. However, if the nn.functional.normalize function is removed from the model's inference function, the accuracy drops to 10.5, which is consistent with the results reported in your paper.

In summary, you did not normalize the features when validating the baseline , but did normalize them when validating your method, which is the root cause of the problem. Retrieval experiments/KNN based on cosine similarity require feature normalization. You can try to verify what I've said. If it's correct, I hope you can update your experimental results on arxiv, as this will seriously affect future research efforts.

Thank you for taking the time to consider my request, and I look forward to hearing from you soon.

Thank you, Yifan. I will thoroughly investigate this matter. If it is the case, we will update our Arxiv paper.