PyTorch implementation for Deep Evidential Learning with Noisy Correspondence for Cross-modal Retrieval (ACM Multimedia 2022). The solution to the noisy correspondence problem in image-text matching.
2022-12-20. We provide the results using the same noise index as NCR, which might be helpful to your research.
Datasets | Flickr30K 1K test | MS-COCO 1K 5-fold test | MS-COCO 5K test | |||||||||||||||||||
Noise (%) | Methods\Metrics | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | Sum | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | Sum | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | Sum |
20 | NCR | 75.0 | 93.9 | 97.5 | 58.3 | 83.0 | 89.0 | 496.7 | 78.7 | 95.8 | 98.5 | 63.3 | 90.4 | 95.8 | 522.5 | 56.9 | 83.6 | 91.0 | 40.6 | 69.8 | 80.1 | 422.0 |
DECL-SAF | 73.1 | 93.0 | 96.2 | 57.0 | 82.0 | 88.4 | 489.7 | 77.2 | 95.9 | 98.4 | 61.6 | 89.0 | 95.3 | 517.4 | 54.9 | 82.5 | 90.3 | 40.1 | 68.9 | 79.6 | 416.3 | |
DECL-SGR | 75.4 | 93.2 | 96.2 | 56.8 | 81.7 | 88.4 | 491.7 | 76.9 | 95.3 | 98.2 | 61.3 | 89.0 | 95.1 | 515.8 | 55.7 | 82.2 | 90.1 | 39.8 | 68.8 | 79.4 | 416.0 | |
DECL-SGRAF | 75.6 | 93.8 | 97.4 | 58.5 | 82.9 | 89.4 | 497.6 | 78.4 | 95.8 | 98.4 | 63.0 | 89.9 | 95.6 | 521.1 | 57.2 | 83.9 | 90.9 | 41.5 | 69.9 | 80.5 | 423.9 | |
50 | NCR | 72.9 | 93.0 | 96.3 | 54.3 | 79.8 | 86.5 | 482.8 | 74.6 | 94.6 | 97.8 | 59.1 | 87.8 | 94.5 | 508.4 | 53.1 | 80.7 | 88.5 | 37.9 | 66.6 | 77.8 | 404.6 |
DECL-SAF | 68.4 | 90.9 | 95.6 | 51.9 | 78.5 | 85.9 | 471.2 | 74.6 | 95.0 | 98.2 | 59.3 | 88.1 | 94.5 | 509.7 | 52.6 | 80.7 | 88.6 | 37.8 | 66.6 | 77.8 | 404.1 | |
DECL-SGR | 71.3 | 90.7 | 94.6 | 52.2 | 78.7 | 86.0 | 473.5 | 74.4 | 94.2 | 98.0 | 58.8 | 87.6 | 94.3 | 507.3 | 53.1 | 80.3 | 88.5 | 37.3 | 66.4 | 77.7 | 403.3 | |
DECL-SGRAF | 72.7 | 92.0 | 95.8 | 54.8 | 80.4 | 87.5 | 483.2 | 76.1 | 95.0 | 98.3 | 60.5 | 88.7 | 94.9 | 513.5 | 54.8 | 82.0 | 89.5 | 38.8 | 67.8 | 78.9 | 411.8 |
import nltk
nltk.download()
> d punkt
Our directory structure of data
.
data
├── f30k_precomp # pre-computed BUTD region features for Flickr30K, provided by SCAN
│ ├── train_ids.txt
│ ├── train_caps.txt
│ ├── ......
│
├── coco_precomp # pre-computed BUTD region features for COCO, provided by SCAN
│ ├── train_ids.txt
│ ├── train_caps.txt
│ ├── ......
│
├── cc152k_precomp # pre-computed BUTD region features for cc152k, provided by NCR
│ ├── train_ids.txt
│ ├── train_caps.tsv
│ ├── ......
│
├── noise_file # Randomly shuffle the index of the image proportionally.
│ ├── f30k
│ │ ├── noise_inx_0.2.npy
│ │ ├── ......
│ │
│ └── coco
│ ├── noise_inx_0.2.npy
│ ├── ......
│
└── vocab # vocab files provided by SCAN and NCR
├── f30k_precomp_vocab.json
├── coco_precomp_vocab.json
└── cc152k_precomp_vocab.json
We follow SCAN to obtain image features and vocabularies.
Following NCR, we use a subset of Conceptual Captions (CC), named CC152K. CC152K contains training 150,000 samples from the CC training split, 1,000 validation samples and 1,000 testing samples from the CC validation split.
If you want to experiment with the same noise index as in the paper, the noise index files can be downloaded from here.
Modify some necessary parameters (i.e., data_path
, vocab_path
, noise_ratio
, warmup_epoch
, module_name
, and folder_name
) in train_xxx.sh
file and run it.
For Flickr30K:
sh train_f30k.sh
For MSCOCO:
sh train_coco.sh
For CC152K:
sh train_cc152k.sh
Modify the data_path
, vocab_path
, checkpoint_paths
in the eval.py
file and run it.
python eval.py
Our reproduced results in evaluation_log. (Better than the original paper)
If DECL is useful for your research, please cite the following paper:
@inproceedings{Qin2022DECL,
author = {Qin, Yang and Peng, Dezhong and Peng, Xi and Wang, Xu and Hu, Peng},
title = {Deep Evidential Learning with Noisy Correspondence for Cross-Modal Retrieval},
year = {2022},
doi = {10.1145/3503161.3547922},
booktitle = {Proceedings of the 30th ACM International Conference on Multimedia},
pages = {4948–4956},
numpages = {9},
location = {Lisboa, Portugal},
series = {MM '22}
}
The code is based on NCR, SGRAF, and SCAN licensed under Apache 2.0.