Open jaeseokbyun opened 1 year ago
Thanks for your interest in our work!
According to the X-VLM code, it will first calculate all image features and text features (line 98-105), and then concatenate them separately (line 107-108) to do matrix multiplication and finally obtain the whole image-text similarity matrix (line 110). As you mentioned, for MSCOCO-FG, the size of the whole similarity matrix will be very large and make it hard for calculation and storage.
To minimize the cost of rewriting the original X-VLM code, our solution is to do the matrix calculation in two steps. We only choose a few features to concatenate and do matrix multiplication each time, and save the result until at last we concatenate all of them to obtain the whole similarity matrix. For example, when calculating text-to-image(t2i) similarity matrix (25000*31244), we only choose 1000 text features to be involved in calculation each time and obtain a small part of similarity matrix (1000*31244). This will significantly reduce the computational requirement of the machine.
Hope this method can help you solve this issue. :)
First of all, thanks for sharing great work!
We tried to use your benchmark dataset. Thus, we downloaded JSON and image files. But, since the size of the image pool is very large (31244) compared to the original dataset (5000), we cannot reproduce re-ranking based methods like X-VLM (which requires saving the text and image feature sequences ) (We used 4 A100 gpus for evaluation.)
In X-VLM code (https://github.com/zengyan-97/X-VLM/blob/f69044712ed840be013ec55c864f1bc3ada0b34c/Retrieval.py), as can be seen in line 98-108, whole image and text feature sequences would be saved for using in line [143].
Could you give any suggestions to deal with this issue? or Could you share the inference code for X-VLM with your dataset?
Sincerely, Jaeseok Byun