Question on CLIP evaluation

ysharma1126 commented 2 years ago

Here, I attach the config YAML for the VISSL implementation of a linear probe evaluation from the CLIP benchmark. RandomResizedCrop and RandomHorizontalFlip are specified in DATA.TRAIN.TRANSFORMS, resulting in a discrepancy between DATA.TRAIN.TRANSFORMS and DATA.TEST.TRANSFORMS. However, in the linear probe evaluation provided in the CLIP code release, the preprocessing for train and test are the same, and are (on the whole) equivalent to what's specified in the config as DATA.TEST.TRANSFORMS, not DATA.TRAIN.TRANSFORMS.

If this discrepancy was intentional, could the reasoning be clarified for users?

iseessel commented 2 years ago

CC: @QuentinDuval

QuentinDuval commented 2 years ago

Hi @ysharma1126,

First of all, thank you for your interest in VISSL :)

Indeed, you are perfectly right, we don't exactly use the same transformation as the CLIP paper, but instead use the same transformations as we usually do for linear evaluations (which uses RandomResizedCrop), mostly because:

of consistency reasons: pre-existing benchmarks such as linear evaluation configuration for imagenet were using random crop, and we kept those transformations to have the benchmarks be consistent with each other.
there is no clear standard on how to best evaluate models: most of those benchmarks we propose (not Kinetics700 in particular) are both featured from the VTAB paper and CLIP, both having different protocols. For instance, VTAB features heavyweight sweeps in one of its protocol of evaluation, which does HP search on transforms such as: random crops VS central crop VS inception crop, etc.

So we chose consistency whenever possible (the only exception we have for RandomResizedCrop is for datasets in which cropping would actually be harmful, such as CLEVR/Count, for which, clearly, random cropping breaks the task, which consist in counting objects, so we need to see all of them), knowing that there was not a single way to evaluate models.

However, you are free to change those augmentations to reproduce CLIP benchmarks protocols, or create a set of configuration to reproduce VTAB protocol / CLIP protocol / etc.

I hope this help :) Quentin

ysharma1126 commented 2 years ago

Thanks! Since we are discussing the configs, a few follow-up questions,

Should the user always assume the transform settings were a deliberate choice? For example, this config for iWILDS-Cam only uses RandomHorizontalFlip during training.
On the topic of configs, can you confirm whether the following config replicates the linear evaluation procedure performed in the Moco paper? I find certain aspects confusing, namely some indicators (file name, sync batch norm config) imply the config was meant to use 8 GPUs, when it instead uses 1 GPU. While I understand that the number of processes shouldn't affect the experimental results (as the batch size is set correctly), such incongruencies do give me pause as a user in whether all empirical details of the reproduction are aligned with the original paper.

Overall, I think it would be helpful if comments could be added specifying (a) the reference to what, if anything, this config should be reproducing and (b) any intentional discrepancies with the reproduction, like what was discussed above with regards to CLIP. With that being said, entirely understand if this wouldn't be worth the time cost.

ysharma1126 commented 2 years ago

One more note on configs, the config for Faster R-CNN on VOC07+12 using the Mocov2 setting differs from the config in the Mocov2 repo w.r.t. SOLVER.WARMUP_ITERS. Was this intentional?

facebookresearch / vissl

Question on CLIP evaluation #485