QinYang79 / DECL

Deep Evidential Learning with Noisy Correspondence for Cross-modal Retrieval ( ACM Multimedia 2022, Pytorch Code)
37 stars 5 forks source link

Is it a fair comparison with NCR? #2

Closed ppj567 closed 1 year ago

ppj567 commented 1 year ago

Hi, thank you for your great work!

The previous method NCR employs SGR as the backbone in their paper and shows good performance. I am wondering why DECL performs even worse than NCR when employing the same SGR backbone? Besides, in the original paper of DECL, is it a fair comparison with NCR by adopting a better SGRAF backbone? Thanks a lot.

image

QinYang79 commented 1 year ago

Hi, thank you for your great work!

The previous method NCR employs SGR as the backbone in their paper and shows good performance. I am wondering why DECL performs even worse than NCR when employing the same SGR backbone? Besides, in the original paper of DECL, is it a fair comparison with NCR by adopting a better SGRAF backbone? Thanks a lot.

image

The results of NCR in the paper are actually the ensemble performance of two models (co-learning), which that of our DECL-SGR is naturally worse. To be fair, our paper reports the results of DECL-SGRAF.

ppj567 commented 1 year ago

Thank you for your reply!

Have you ever tried to train DECL twice using SGR backbone so as to ensemble two DECL models? Or extend the NCR model to SGRAF backbone for comparison (this case may involve four models).

QinYang79 commented 1 year ago

Thank you for your reply!

Have you ever tried to train DECL twice using SGR backbone so as to ensemble two DECL models? Or extend the NCR model to SGRAF backbone for comparison (this case may envolve four models).

Sorry, I didn't try it. But based on my experience and intuition, dual SGRs might be better than SGRAF. You can test it yourself. I tried NCR-SGRAF (one is SAF and one is SGR), but it is not as good as the standard NCR (maybe, I forgot).

ppj567 commented 1 year ago

Thanks, my concerns are well-addressed!

By the way, I am still curious about why the noise_index will greatly affect the overall performance? For instance, in the original paper of DECL, it outperforms NCR by a quite margin on Flickr30K with 40% noisy pairs (Sum: 479.6 vs. 467.1). In contrast, when using the same noise_index on Flickr30K with 50% noisy pairs in this repo, DECL reports Sum of 483.2 and that of NCR is 482.8, which are quite close.

QinYang79 commented 1 year ago

Thanks, my concerns are well-addressed!

By the way, I am still curious about why the noise_index will greatly affect the overall performance? For instance, in the original paper of DECL, it outperforms NCR by a quite margin on Flickr30K with 40% noisy pairs (Sum: 479.6 vs. 467.1). In contrast, when using the same noise_index on Flickr30K with 50% noisy pairs in this repo, DECL reports Sum of 483.2 and that of NCR is 482.8, which are quite close.

It is actually caused by two different generation methods for noisy correspondence. DECL shuffles the order of images, while NCR shuffles the order of captions. However, what you need to know is that one image has 5 captions in MS-COCO and Flickr30K. Therefore, our noisy correspondence generation will be more strict compared to that of NCR. Please keep this in mind when experimenting, thanks.

ppj567 commented 1 year ago

Shuffling the order of images will lead to such a situation: At each training step, an image either has five all-paired texts ,or has five all-unpaired texts. It makes the model easier to learn the inductive bias of the distributions of these candidate samples. In other words, the model can better distinguish the matching score distributions from a group of all-paired samples and another group of all-unpaired samples (somehow like the classification task). This also can be seen as the fundation of extending the deep evidential learning from classification task to retrieval task. IMHO, shuffling the order of captions is a more strict setting. Anyway, thanks for this nice work!

QinYang79 commented 1 year ago

Shuffling the order of images will lead to such a situation: At each training step, an image either has five all-paired texts ,or has five all-unpaired texts. It makes the model easier to learn the inductive bias of the distributions of these candidate samples. In other words, the model can better distinguish the matching score distributions from a group of all-paired samples and another group of all-unpaired samples (somehow like the classification task). This also can be seen as the fundation of extending the deep evidential learning from classification task to retrieval task. IMHO, shuffling the order of captions is a more strict setting. Anyway, thanks for this nice work!

Unfortunately, the actual situation is that the index of NCR will be divided very well, but ours is difficult to distinguish especially the high noise, which is consistent with the experimental results. In practical applications, it may not be easy for you to find multiple co-occurrence pairs of the same image. You can compare the performance of CC152K, it is a one-to-one noisy correspondence. Thank you for your questions and interest in our work!