about the construction method of cc12m retrieval files

SJLeo commented 1 year ago

Thank you for your outstanding work. I am now preparing to transfer the method to the clip of resnet50, but I do not know how to build the cc12m retrieval file (eg. text.index and metadata dir).

altndrr commented 1 year ago

Thank you for showing interest in our work. The creation of the faiss indices can be quite complex and involves multiple steps. Here is a brief overview of the process:

To begin with, you need to download the cc12m dataset using img2dataset. You can find the instructions to download the dataset here.
The next step is to use the clip-retrieval library to save the image and text features of the cc12m dataset on disk. You can refer to this section of the clip-retrieval README for an example.
Once you have the embeddings, you can create the faiss index using the clip-retrieval library. You can find more information about this in this section.
You also need to convert the cc12m metadata to the arrow format. The command to do this is available here.
Finally, you have to modify the indices.json file in this repository to add the item for the newly-created index.

Please note that our method only requires the text of the cc12m dataset. Therefore, it is possible to avoid or simplify some of the steps mentioned above. For instance, you can write a script that takes the cc12m.tsv file as input and directly embeds all the captions with your preferred CLIP model. This would completely avoid the need to download the cc12m images. Another option, since the metadata for the cc12m dataset is already available in this repository, is to write a script that reads the metadata and saves the CLIP RN50 textual embeddings on disk. After this step, you can simply follow step 3 and step 5 to finalize the creation of the faiss index.

We recommend installing the img2dataset and clip-retrieval libraries on a separate Python environment as they depend on older pytorch versions than our repository.

Please let us know if you face any issues while processing the index.

SJLeo commented 1 year ago

Thank you for your patience and detailed answer. I will give it a try.

altndrr commented 1 year ago

If you only need the CC12M index for RN50, I can share with you the version we used for one of our ablations. You can find it here. After download and extraction, you have to follow step 5 to make it visible for the retrieval server:

{
+   "RN50_CC12M": "./artifacts/models/databases/cc12m/rn50/",
    "ViT-L-14_CC12M": "./artifacts/models/databases/cc12m/vit-l-14/",
    "ViT-L-14_ENGLISH_WORDS": "./artifacts/models/databases/english_words/vit-l-14/",
    "ViT-L-14_PMD_TOP5": "./artifacts/models/databases/pmd_top5/vit-l-14/",
    "ViT-L-14_WORDNET": "./artifacts/models/databases/wordnet/vit-l-14/"
}

You can use the command below to test it with our method.

python src/train.py experiment=method/cased data=caltech101 ++model.model_name=RN50 ++model.vocabulary.retrieval_client.index_name=RN50_CC12M

Let me know if you experience any issue.

altndrr / vic

about the construction method of cc12m retrieval files #8