Possibility of misreported performances due to skipping query image cropping

sungonce commented 2 years ago

Hello @feymanpriv, First of all, thank you for sharing good codes including this one :D

I saw the query image cropping issue on ROxford / RParis test presented by @andrefaraujo.

Using the code and R101 DOLG model weight you uploaded, I tried the ROxford5K / RParis6K experiment with query image cropping. The results were shown as below table: (5 multi-scale inference is applied follow your paper)

	ROxf-M	+1M	RPar-M	+1M	ROxf-H	+1M	RPar-H	+1M
DOLG-paper (maybe w/o query cropping)	81.5	77.4	91.0	83.3	61.1	54.8	80.3	66.7
DOLG-paddle-weight (w/o query cropping)	83.5	-	91.8	-	65.3	-	82.8	-
DOLG-paddle-weight (with query cropping)	82.2	74.0	91.1	80.7	64.4	51.4	81.9	63.2

The results of this experiment seem to indicate that the results of the paper may have been misreported.

p.s. DOLG incorporates CurricularFace Loss, not an ArcFace. However, the paper only describes ArcFace. It seems to need correction.

andrefaraujo commented 2 years ago

Thanks @sungonce for the experiments and for looping me in here!

Indeed as expected the numbers go down in the 1M evals. The numbers still look quite high to me, though, I wouldn't have expected such high performance in the 1M evals (but I could be wrong!). Anyway, it would be great if the authors @feymanpriv could re-run the experiments and confirm the results.

andrefaraujo commented 2 years ago

To explain my reasoning on why I wouldn't have expected such high DOLG performance in the 1M evals: if we look at Tab 1 in the DOLG paper, we see a huge difference between "R50-DELG(GLDv2-clean)" and "R50-DELG(GLDv2-clean)^r (reproduced)" in Section C of the table (8.7% in ROxfM, 8.09% in RParM, 8.8% in ROxfH, 15.9% in RParH). My understanding is that this difference is mainly due to query cropping (please correct me if I am wrong). So that's why I would have expected a larger drop when query cropping is done.

sungonce commented 2 years ago

Thanks @sungonce for the experiments and for looping me in here!

Indeed as expected the numbers go down in the 1M evals. The numbers still look quite high to me, though, I wouldn't have expected such high performance in the 1M evals (but I could be wrong!). Anyway, it would be great if the authors @feymanpriv could re-run the experiments and confirm the results.

Dear @andrefaraujo, I updated the second row of table as below:

	ROxf-M	+1M	RPar-M	+1M	ROxf-H	+1M	RPar-H	+1M
DOLG-paper (maybe w/o query cropping)	81.5	77.4	91.0	83.3	61.1	54.8	80.3	66.7
DOLG-paddle-weight (w/o query cropping)	83.5	79.3	91.8	83.1	65.3	58.3	82.8	67.7
DOLG-paddle-weight (with query cropping)	82.2	74.0	91.1	80.7	64.4	51.4	81.9	63.2

As you can see from the first and second rows of the table I posted (the two performances under the (maybe) same query cropping conditions), the authors probably improved the performance compared to the performance at the time of submission of the paper. Perhaps this is the one reason which makes the drop of 1M eval (When comparing first and third rows) seem small. Also, when comparing the second and the third rows, the performance difference due to query cropping is quite large, but not as much as the difference between the DELGs in section C of the DOLG paper. Presumably, the authors' reproducing method also contributed to the performance difference between the DELGs to some extent.

What I did was a simple check, and I can't vouch for it as I'm not the author. I also hope the authors (@feymanpriv) will re-conduct experiments with the correct evaluation protocol and revise the paper if necessary so that other papers can refer to the correct performance. It should also be noted that this may also affect the reviews/results of many (landmark) image retrieval papers submitted or to be submitted to the conferences/journals. (especially where the review process is currently underway (e.g., CVPR 2022).)

andrefaraujo commented 2 years ago

Thanks again @sungonce !

Presumably, the authors' reproducing method also contributed to the performance difference between the DELGs to some extent.

I see, that would indeed make sense.

I also hope the authors (@feymanpriv) will re-conduct experiments with the correct evaluation protocol and revise the paper if necessary so that other papers can refer to the correct performance.

+1, I hope they will respond to this thread and update their paper accordingly. Right now, we cannot know if the numbers in their paper could be trusted.

feymanpriv commented 2 years ago

@sungonce @andrefaraujo Sorry for the misreported performances due to negligence of the trainee in evaluation process. I have modified part of the final results in this repo and 1M results will be updated soon. In fact, almost all the performances of DOLG based on resnet50 and resnet101 are higher evidently then those in the paper without 1M distractors(under query cropping protocol). And model weights will be uploaded for verification.

andrefaraujo commented 2 years ago

Thanks @feymanpriv for the update. Looking forward to the 1M results soon!

HomeworkSOTA commented 2 years ago

Hi @feymanpriv, I conduct experiment with the R101-DOLG pytorch model you uploaded and find huge differences between the performance of each dataset and those you reported, especially on the +R1M setting. The results are shown below (5 scale inference and query cropping are adopted) @sungonce @andrefaraujo	Model	ROxf (M)	ROxf + R1M (M)	RPar (M)	RPar + R1M (M)	ROxf (H)	ROxf + R1M (H)	RPar (H)	RPar + R1M (H)
[R101-DOLG (yours)	82.37	73.63	90.97	80.44	64.93	51.57	81.71	62.95
[R101-DOLG (my re-test)	80.12	66.64	89.19	73.00	60.10	39.94	77.34	52.44

The evaluation and feature extraction code provided by the authors of Revisited Oxford/Paris paper are adopted for all experiments. And, the screen shot of our re-test is shown below sreen_shot

peternara commented 2 years ago

Hi. Thanks for posting good code. @feymanpriv

I am aware of the above issue. I am writing paper. I'm trying to cite your paper.

I would like to hear your answer clearly. Your performance published publicly at https://openaccess.thecvf.com/content/ICCV2021/papers/Yang_DOLG_Single-Stage_Image_Retrieval_With_Deep_Orthogonal_Fusion_of_Local_ICCV_2021_paper.pdf. Is the performance of "Table 1" the result of a non-crop query? (I also checked, but according to the code, it is not a crop result.)

You seem to have mentioned this, but it looks like you need a clearer answer.

Even more clearly, it means the red box in the blow table.

스크린샷 2022-04-27 오후 4 07 15

This should be clear, and I think it's very important for the next researchers. I think you should give a clear answer to this. cc. @andrefaraujo @sungonce @HomeworkSOTA

thanks.

HomeworkSOTA commented 2 years ago

After carefully reading your open source pytorch code, I found some unfairness. In your code, an updated resnet pre-training model is used, which performs better on ImageNet dataset (BRG pre-training model provided by facebook). And the existing methods, such as GeM, and SOLAR, use the one provieded by Filip Radenovic. I think this may be the reason why your model has a surprisingly good performance on R1M. Besides that, I think it is an unfair comparison and you need to compare on a consistent benchmark to truly show the effectiveness of your method. cc. @feymanpriv @andrefaraujo @sungonce @peternara

HomeworkSOTA commented 2 years ago

I find that a paper named "Deep Fusion of Multi-attentive Local and Global Features with Higher Efficiency for Image Retrieval" was recently submitted to ICLR2022, and the first author Baorong Shi is the fourth author of the DOLG paper. I noticed that in this paper DELG was also re-implemented, but the results are dramatically different from that in the DOLG paper especially under the R1M setting. Since the same author Baorong Shi exists in author lists of both papers, I think there must be some problems. @peternara @feymanpriv @andrefaraujo @sungonce results

feymanpriv commented 2 years ago

Hi. Thanks for posting good code. @feymanpriv

I am aware of the above issue. I am writing paper. I'm trying to cite your paper.

I would like to hear your answer clearly. Your performance published publicly at https://openaccess.thecvf.com/content/ICCV2021/papers/Yang_DOLG_Single-Stage_Image_Retrieval_With_Deep_Orthogonal_Fusion_of_Local_ICCV_2021_paper.pdf. Is the performance of "Table 1" the result of a non-crop query? (I also checked, but according to the code, it is not a crop result.)

You seem to have mentioned this, but it looks like you need a clearer answer.

Even more clearly, it means the red box in the blow table.

This should be clear, and I think it's very important for the next researchers. I think you should give a clear answer to this. cc. @andrefaraujo @sungonce @HomeworkSOTA

thanks.

In fact, the results in the paper were misreported. I guess the reason may be that the tested model was not the final and best one and the queries were not cropped. I have checked and re-produced the final results in this [https://github.com/feymanpriv/DOLG]. The results are a little bit better than the paper.

feymanpriv commented 2 years ago

I find that a paper named "Deep Fusion of Multi-attentive Local and Global Features with Higher Efficiency for Image Retrieval" was recently submitted to ICLR2022, and the first author Baorong Shi is the fourth author of the DOLG paper. I noticed that in this paper DELG was also re-implemented, but the results are dramatically different from that in the DOLG paper especially under the R1M setting. Since the same author Baorong Shi exists in author lists of both papers, I think there must be some problems. @peternara @feymanpriv @andrefaraujo @sungonce

Sorry, I have no idea about this paper [Deep Fusion of Multi-attentive Local and Global Features with Higher Efficiency for Image Retrieval]. The author left the company long ago and the results of DELG may be re-implemented by herself.

feymanpriv commented 2 years ago

Hi @feymanpriv, I conduct experiment with the R101-DOLG pytorch model you uploaded and find huge differences between the performance of each dataset and those you reported, especially on the +R1M setting. The results are shown below (5 scale inference and query cropping are adopted) @sungonce @andrefaraujo

Model ROxf (M) ROxf + R1M (M) RPar (M) RPar + R1M (M) ROxf (H) ROxf + R1M (H) RPar (H) RPar + R1M (H) [R101-DOLG (yours) 82.37 73.63 90.97 80.44 64.93 51.57 81.71 62.95 [R101-DOLG (my re-test) 80.12 66.64 89.19 73.00 60.10 39.94 77.34 52.44 The evaluation and feature extraction code provided by the authors of Revisited Oxford/Paris paper are adopted for all experiments. And, the screen shot of our re-test is shown below

Can you provide me with your evaluation codes? I tested the model with torch and paddle. The results were similar with sungonce did.

HomeworkSOTA commented 2 years ago

I use the evaluation codes provided by filipradenovic. I just replaced the model in the evaluation codes with the pytorch model you provided, and noted the BRG input format. I don't think you should initialized the ResNet101 model with a better pre-trained model, which would lead to an unfair comparison. I have tried to retrain DOLG with the pre-trained model provided by filipradenovic and the training codes you provided. The results show that there is no significant improvement relative to DELG. Besides, I also find anthor problem. In the paper, you said "We randomly divide 80% of the dataset for training and the rest 20% for validation. " but, in the train_image_list you provided, all images from GLDv2-clean dataset are present.

Zy-Zhang commented 1 year ago

I use the evaluation codes provided by filipradenovic. I just replaced the model in the evaluation codes with the pytorch model you provided, and noted the BRG input format. I don't think you should initialized the ResNet101 model with a better pre-trained model, which would lead to an unfair comparison. I have tried to retrain DOLG with the pre-trained model provided by filipradenovic and the training codes you provided. The results show that there is no significant improvement relative to DELG. Besides, I also find anthor problem. In the paper, you said "We randomly divide 80% of the dataset for training and the rest 20% for validation. " but, in the train_image_list you provided, all images from GLDv2-clean dataset are present.

I @HomeworkSOTA , have you obtained any responses from the author? Privately? I'm also trying to reproduce DOLG performance with their Pytorch code, but it is difficult even using the pretrained model provided by Facebook because the parameter setting in the project is different from the one in the published paper. I don't know which setting should I use.

feymanpriv / DOLG-paddle

Possibility of misreported performances due to skipping query image cropping #3