Open sungonce opened 2 years ago
Thanks @sungonce for the experiments and for looping me in here!
Indeed as expected the numbers go down in the 1M evals. The numbers still look quite high to me, though, I wouldn't have expected such high performance in the 1M evals (but I could be wrong!). Anyway, it would be great if the authors @feymanpriv could re-run the experiments and confirm the results.
To explain my reasoning on why I wouldn't have expected such high DOLG performance in the 1M evals: if we look at Tab 1 in the DOLG paper, we see a huge difference between "R50-DELG(GLDv2-clean)" and "R50-DELG(GLDv2-clean)^r (reproduced)" in Section C of the table (8.7% in ROxfM, 8.09% in RParM, 8.8% in ROxfH, 15.9% in RParH). My understanding is that this difference is mainly due to query cropping (please correct me if I am wrong). So that's why I would have expected a larger drop when query cropping is done.
Thanks @sungonce for the experiments and for looping me in here!
Indeed as expected the numbers go down in the 1M evals. The numbers still look quite high to me, though, I wouldn't have expected such high performance in the 1M evals (but I could be wrong!). Anyway, it would be great if the authors @feymanpriv could re-run the experiments and confirm the results.
Dear @andrefaraujo, I updated the second row of table as below:
ROxf-M | +1M | RPar-M | +1M | ROxf-H | +1M | RPar-H | +1M | |
---|---|---|---|---|---|---|---|---|
DOLG-paper (maybe w/o query cropping) | 81.5 | 77.4 | 91.0 | 83.3 | 61.1 | 54.8 | 80.3 | 66.7 |
DOLG-paddle-weight (w/o query cropping) | 83.5 | 79.3 | 91.8 | 83.1 | 65.3 | 58.3 | 82.8 | 67.7 |
DOLG-paddle-weight (with query cropping) | 82.2 | 74.0 | 91.1 | 80.7 | 64.4 | 51.4 | 81.9 | 63.2 |
As you can see from the first and second rows of the table I posted (the two performances under the (maybe) same query cropping conditions), the authors probably improved the performance compared to the performance at the time of submission of the paper. Perhaps this is the one reason which makes the drop of 1M eval (When comparing first and third rows) seem small. Also, when comparing the second and the third rows, the performance difference due to query cropping is quite large, but not as much as the difference between the DELGs in section C of the DOLG paper. Presumably, the authors' reproducing method also contributed to the performance difference between the DELGs to some extent.
What I did was a simple check, and I can't vouch for it as I'm not the author. I also hope the authors (@feymanpriv) will re-conduct experiments with the correct evaluation protocol and revise the paper if necessary so that other papers can refer to the correct performance. It should also be noted that this may also affect the reviews/results of many (landmark) image retrieval papers submitted or to be submitted to the conferences/journals. (especially where the review process is currently underway (e.g., CVPR 2022).)
Thanks again @sungonce !
Presumably, the authors' reproducing method also contributed to the performance difference between the DELGs to some extent.
I see, that would indeed make sense.
I also hope the authors (@feymanpriv) will re-conduct experiments with the correct evaluation protocol and revise the paper if necessary so that other papers can refer to the correct performance.
+1, I hope they will respond to this thread and update their paper accordingly. Right now, we cannot know if the numbers in their paper could be trusted.
@sungonce @andrefaraujo Sorry for the misreported performances due to negligence of the trainee in evaluation process. I have modified part of the final results in this repo and 1M results will be updated soon. In fact, almost all the performances of DOLG based on resnet50 and resnet101 are higher evidently then those in the paper without 1M distractors(under query cropping protocol). And model weights will be uploaded for verification.
Thanks @feymanpriv for the update. Looking forward to the 1M results soon!
Hi @feymanpriv, I conduct experiment with the R101-DOLG pytorch model you uploaded and find huge differences between the performance of each dataset and those you reported, especially on the +R1M setting. The results are shown below (5 scale inference and query cropping are adopted) @sungonce @andrefaraujo | Model | ROxf (M) | ROxf + R1M (M) | RPar (M) | RPar + R1M (M) | ROxf (H) | ROxf + R1M (H) | RPar (H) | RPar + R1M (H) |
---|---|---|---|---|---|---|---|---|---|
[R101-DOLG (yours) | 82.37 | 73.63 | 90.97 | 80.44 | 64.93 | 51.57 | 81.71 | 62.95 | |
[R101-DOLG (my re-test) | 80.12 | 66.64 | 89.19 | 73.00 | 60.10 | 39.94 | 77.34 | 52.44 |
The evaluation and feature extraction code provided by the authors of Revisited Oxford/Paris paper are adopted for all experiments. And, the screen shot of our re-test is shown below
Hi. Thanks for posting good code. @feymanpriv
I am aware of the above issue. I am writing paper. I'm trying to cite your paper.
I would like to hear your answer clearly. Your performance published publicly at https://openaccess.thecvf.com/content/ICCV2021/papers/Yang_DOLG_Single-Stage_Image_Retrieval_With_Deep_Orthogonal_Fusion_of_Local_ICCV_2021_paper.pdf. Is the performance of "Table 1" the result of a non-crop query? (I also checked, but according to the code, it is not a crop result.)
You seem to have mentioned this, but it looks like you need a clearer answer.
Even more clearly, it means the red box in the blow table.
This should be clear, and I think it's very important for the next researchers. I think you should give a clear answer to this. cc. @andrefaraujo @sungonce @HomeworkSOTA
thanks.
After carefully reading your open source pytorch code, I found some unfairness. In your code, an updated resnet pre-training model is used, which performs better on ImageNet dataset (BRG pre-training model provided by facebook). And the existing methods, such as GeM, and SOLAR, use the one provieded by Filip Radenovic. I think this may be the reason why your model has a surprisingly good performance on R1M. Besides that, I think it is an unfair comparison and you need to compare on a consistent benchmark to truly show the effectiveness of your method. cc. @feymanpriv @andrefaraujo @sungonce @peternara
I find that a paper named "Deep Fusion of Multi-attentive Local and Global Features with Higher Efficiency for Image Retrieval" was recently submitted to ICLR2022, and the first author Baorong Shi is the fourth author of the DOLG paper. I noticed that in this paper DELG was also re-implemented, but the results are dramatically different from that in the DOLG paper especially under the R1M setting. Since the same author Baorong Shi exists in author lists of both papers, I think there must be some problems. @peternara @feymanpriv @andrefaraujo @sungonce
Hi. Thanks for posting good code. @feymanpriv
I am aware of the above issue. I am writing paper. I'm trying to cite your paper.
I would like to hear your answer clearly. Your performance published publicly at https://openaccess.thecvf.com/content/ICCV2021/papers/Yang_DOLG_Single-Stage_Image_Retrieval_With_Deep_Orthogonal_Fusion_of_Local_ICCV_2021_paper.pdf. Is the performance of "Table 1" the result of a non-crop query? (I also checked, but according to the code, it is not a crop result.)
You seem to have mentioned this, but it looks like you need a clearer answer.
Even more clearly, it means the red box in the blow table.
This should be clear, and I think it's very important for the next researchers. I think you should give a clear answer to this. cc. @andrefaraujo @sungonce @HomeworkSOTA
thanks.
In fact, the results in the paper were misreported. I guess the reason may be that the tested model was not the final and best one and the queries were not cropped. I have checked and re-produced the final results in this [https://github.com/feymanpriv/DOLG]. The results are a little bit better than the paper.
I find that a paper named "Deep Fusion of Multi-attentive Local and Global Features with Higher Efficiency for Image Retrieval" was recently submitted to ICLR2022, and the first author Baorong Shi is the fourth author of the DOLG paper. I noticed that in this paper DELG was also re-implemented, but the results are dramatically different from that in the DOLG paper especially under the R1M setting. Since the same author Baorong Shi exists in author lists of both papers, I think there must be some problems. @peternara @feymanpriv @andrefaraujo @sungonce
Sorry, I have no idea about this paper [Deep Fusion of Multi-attentive Local and Global Features with Higher Efficiency for Image Retrieval]. The author left the company long ago and the results of DELG may be re-implemented by herself.
Hi @feymanpriv, I conduct experiment with the R101-DOLG pytorch model you uploaded and find huge differences between the performance of each dataset and those you reported, especially on the +R1M setting. The results are shown below (5 scale inference and query cropping are adopted) @sungonce @andrefaraujo
Model ROxf (M) ROxf + R1M (M) RPar (M) RPar + R1M (M) ROxf (H) ROxf + R1M (H) RPar (H) RPar + R1M (H) [R101-DOLG (yours) 82.37 73.63 90.97 80.44 64.93 51.57 81.71 62.95 [R101-DOLG (my re-test) 80.12 66.64 89.19 73.00 60.10 39.94 77.34 52.44 The evaluation and feature extraction code provided by the authors of Revisited Oxford/Paris paper are adopted for all experiments. And, the screen shot of our re-test is shown below
Can you provide me with your evaluation codes? I tested the model with torch and paddle. The results were similar with sungonce did.
I use the evaluation codes provided by filipradenovic. I just replaced the model in the evaluation codes with the pytorch model you provided, and noted the BRG input format. I don't think you should initialized the ResNet101 model with a better pre-trained model, which would lead to an unfair comparison. I have tried to retrain DOLG with the pre-trained model provided by filipradenovic and the training codes you provided. The results show that there is no significant improvement relative to DELG. Besides, I also find anthor problem. In the paper, you said "We randomly divide 80% of the dataset for training and the rest 20% for validation. " but, in the train_image_list you provided, all images from GLDv2-clean dataset are present.
I use the evaluation codes provided by filipradenovic. I just replaced the model in the evaluation codes with the pytorch model you provided, and noted the BRG input format. I don't think you should initialized the ResNet101 model with a better pre-trained model, which would lead to an unfair comparison. I have tried to retrain DOLG with the pre-trained model provided by filipradenovic and the training codes you provided. The results show that there is no significant improvement relative to DELG. Besides, I also find anthor problem. In the paper, you said "We randomly divide 80% of the dataset for training and the rest 20% for validation. " but, in the train_image_list you provided, all images from GLDv2-clean dataset are present.
I @HomeworkSOTA , have you obtained any responses from the author? Privately? I'm also trying to reproduce DOLG performance with their Pytorch code, but it is difficult even using the pretrained model provided by Facebook because the parameter setting in the project is different from the one in the published paper. I don't know which setting should I use.
Hello @feymanpriv, First of all, thank you for sharing good codes including this one :D
I saw the query image cropping issue on ROxford / RParis test presented by @andrefaraujo.
Using the code and R101 DOLG model weight you uploaded, I tried the ROxford5K / RParis6K experiment with query image cropping. The results were shown as below table: (5 multi-scale inference is applied follow your paper)
The results of this experiment seem to indicate that the results of the paper may have been misreported.
p.s. DOLG incorporates CurricularFace Loss, not an ArcFace. However, the paper only describes ArcFace. It seems to need correction.