Question about CLIP crop baseline

SunzeY commented 11 months ago

Hi, sorry to bother you, but I still have trouble achieving 40 AP on LVIS with CLIP baseline. I input image by padding shorter edge. These are image before CLIP standard transformation(resize to 224 and normalize based on ImageNet statistics) CLIP standard transform

Compose(
    ToTensor()
    Resize(size=(224, 224), interpolation=bicubic, max_size=None, antialias=None)
    Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
)

input image(with annotation id 1-5 in LVIS_v1_val from left to right)

I use a photo of a {} as text prompt with [x['name'] for x in lvis.cats.values()] for class name. But I can only get result of 25.4 AP using LVIS standard API.

Is there anything import missing? Or if it possible to share your code of CLIP baseline.

By the way, I find a paper submission in ICLR-24 with CLIP baseline with similar overall AP of yours but opposite APr, APc, APf result.

PhyscalX commented 11 months ago

Hi, @SunzeY Do you assign each box to multiple categories to match the 300 candidates limit? (See eval_cls.py) We can only reproduce the overall AP similar to RegionSpot using the process in eval_cls.py. By the way, you should also pad the image with the pixel mean. (See Fig7) We can achieve better EVA-CLIP performance (+3AP) using the ensemble of following templates (not work for OpenAI CLIP):

templates = [
    'a photo of a {}.',
    'a photo of the {}.',
    'a bad photo of a {}.',
    'a bad photo of the {}.',
    'a good photo of a {}.',
    'a good photo of the {}.',
    'a photo of a small {}.',
    'a photo of the small {}.',
    'a photo of a large {}.',
    'a photo of the large {}.',
]

SunzeY commented 11 months ago

I solve the problem following your suggestion! I believe “assign each box to multiple categories” is very tricky trick to get better AP on LVIS. Thank you very much for your reply.

PhyscalX commented 11 months ago

“assign each box to multiple categories” is the conventional procedure for open/close-vocabulary detection. We consider that this procedure over-estimates the power of region classification. Instead, region description (caption/instruction/...) tasks are more challenging and useful.

SunzeY commented 11 months ago

thank you for sharing!

baaivision / tokenize-anything

Question about CLIP crop baseline #4