baaivision / tokenize-anything

[ECCV 2024] Tokenize Anything via Prompting
Apache License 2.0
503 stars 19 forks source link

Question about CLIP crop baseline #4

Closed SunzeY closed 9 months ago

SunzeY commented 9 months ago

Hi, sorry to bother you, but I still have trouble achieving 40 AP on LVIS with CLIP baseline. I input image by padding shorter edge. These are image before CLIP standard transformation(resize to 224 and normalize based on ImageNet statistics) CLIP standard transform

Compose(
    ToTensor()
    Resize(size=(224, 224), interpolation=bicubic, max_size=None, antialias=None)
    Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
)

input image(with annotation id 1-5 in LVIS_v1_val from left to right) image

I use a photo of a {} as text prompt with [x['name'] for x in lvis.cats.values()] for class name. But I can only get result of 25.4 AP using LVIS standard API.

Is there anything import missing? Or if it possible to share your code of CLIP baseline.

By the way, I find a paper submission in ICLR-24 with CLIP baseline with similar overall AP of yours but opposite APr, APc, APf result.

PhyscalX commented 9 months ago

Hi, @SunzeY Do you assign each box to multiple categories to match the 300 candidates limit? (See eval_cls.py) We can only reproduce the overall AP similar to RegionSpot using the process in eval_cls.py. By the way, you should also pad the image with the pixel mean. (See Fig7) We can achieve better EVA-CLIP performance (+3AP) using the ensemble of following templates (not work for OpenAI CLIP):

templates = [
    'a photo of a {}.',
    'a photo of the {}.',
    'a bad photo of a {}.',
    'a bad photo of the {}.',
    'a good photo of a {}.',
    'a good photo of the {}.',
    'a photo of a small {}.',
    'a photo of the small {}.',
    'a photo of a large {}.',
    'a photo of the large {}.',
]
SunzeY commented 9 months ago

I solve the problem following your suggestion! I believe “assign each box to multiple categories” is very tricky trick to get better AP on LVIS. Thank you very much for your reply.

PhyscalX commented 9 months ago

“assign each box to multiple categories” is the conventional procedure for open/close-vocabulary detection. We consider that this procedure over-estimates the power of region classification. Instead, region description (caption/instruction/...) tasks are more challenging and useful.

SunzeY commented 9 months ago

thank you for sharing!