facebookresearch / CiT

Code for the paper titled "CiT Curation in Training for Effective Vision-Language Data".
Other
78 stars 1 forks source link

the core idea #3

Open huang-xx opened 1 year ago

huang-xx commented 1 year ago

The core idea of your paper is to select data similar to metadata for training from noisy datasets.

One of my questions is: When you use ImageNet's label as metadata, isn't it just selecting data belonging to a specific ImageNet category from the noisy dataset to participate the training process?

So what this paper does is select data close to the ImageNet distribution from the noisy dataset to train the model and then compare the results with other models, such as CLIP on ImageNet dataset. And then claim your method trains faster and performs better?

howardhsu commented 1 year ago

Thanks for your interests.

The curation differences of this paper from CLIP is: (1) CLIP's WIT 400M is built from a much larger metadata (queries), including WordNet (so it covers IN labels), CiT so far uses smaller ones (see Table 7, not just IN-1K but also IN-21K, 26 tasks combined etc.) for efficiency. (2) WIT400M is built offline, CiT uses online model-based curation on semantic-level (Table 8). (Note CiT is not substring matching, or hard class assignment, metadata is NOT used for training). Thus Table 7 shows even IN-1K can generalize to other tasks.

Hope this answered your question.