Question about experiment setting.

SunzeY commented 11 months ago

Awesome work, Congratulations! I have some questions about the experiment setting.

In Zero-shot instance segmentation, you still use ViTDet classification result, However, TAP model can generate semantic token and do classification, have you tried regard ViTDet as a pure object proposal network and use TAP classification result for this task?
In Zero-shot instance classification, cropping->clip create a strong baseline. I have tried this before but cannot achieve AP of yours. In my implementation. I test by center crop 1.5x scale-upped square area. Are there any other tricks to improve the classification accuracy?
Similar to 2, Does your data annotation process using different tricks to process the image cropping sent into CLIP? Did you use SA gt mask do background blurring or something else. (I see you do paste to place masked object onto a pure color background in Fig7. I suspect this hurt context information a lot for CLIP to generate correct image feature. Does this mean a little finetuning of CLIP model (like MaskAdaptedCLIP or Alpha-CLIP) can extract better feature for knowledge distillation?)

PhyscalX commented 11 months ago

Hi, @SunzeY

Recomputing classes with the method in eval_cls for eval_seg shows higher COCO mask AP, and similar LVIS mask AP. This means that TAP has achieved saturated classification performance for instance segmentation.
We use ResizeLongestEdge + CenterPaste instead of ResizeShortestEdge + CenterCrop. Cropping will remove some foreground context and lead to performance degeneration.
We generate CLIP image embeddings using the processing in 2) w/ background removing (Fig7) (As SAM does in D.5) We intuitively consider that background context will generate smoother concept distributions. We consider that AlphaCLIP w/ masks can generate more appropriate concept distributions. We are looking forward to stronger AlphaCLIP models, 1B/5B params trained on larger datasets. 😄

SunzeY commented 11 months ago

Thanks for replying :), this is really a wonderful work for research community!

baaivision / tokenize-anything