cwj1412 / MSCOCO-Flikcr30K_FG

Benchmark data for "Rethinking Benchmarks for Cross-modal Image-text Retrieval" (SIGIR 2023)
22 stars 0 forks source link

Memory issue for new image pool #2

Open jaeseokbyun opened 1 year ago

jaeseokbyun commented 1 year ago

First of all, thanks for sharing great work!

We tried to use your benchmark dataset. Thus, we downloaded JSON and image files. But, since the size of the image pool is very large (31244) compared to the original dataset (5000), we cannot reproduce re-ranking based methods like X-VLM (which requires saving the text and image feature sequences ) (We used 4 A100 gpus for evaluation.)

In X-VLM code (https://github.com/zengyan-97/X-VLM/blob/f69044712ed840be013ec55c864f1bc3ada0b34c/Retrieval.py), as can be seen in line 98-108, whole image and text feature sequences would be saved for using in line [143].

Could you give any suggestions to deal with this issue? or Could you share the inference code for X-VLM with your dataset?

Sincerely, Jaeseok Byun

cwj1412 commented 1 year ago

Thanks for your interest in our work!

According to the X-VLM code, it will first calculate all image features and text features (line 98-105), and then concatenate them separately (line 107-108) to do matrix multiplication and finally obtain the whole image-text similarity matrix (line 110). As you mentioned, for MSCOCO-FG, the size of the whole similarity matrix will be very large and make it hard for calculation and storage.

To minimize the cost of rewriting the original X-VLM code, our solution is to do the matrix calculation in two steps. We only choose a few features to concatenate and do matrix multiplication each time, and save the result until at last we concatenate all of them to obtain the whole similarity matrix. For example, when calculating text-to-image(t2i) similarity matrix (25000*31244), we only choose 1000 text features to be involved in calculation each time and obtain a small part of similarity matrix (1000*31244). This will significantly reduce the computational requirement of the machine.

Hope this method can help you solve this issue. :)