How to recognize 22,000 classes?

HarborYuan / ovsam

[ECCV 2024] The official code of paper "Open-Vocabulary SAM".

https://www.mmlab-ntu.com/project/ovsam

Other

914 stars 27 forks source link

How to recognize 22,000 classes? #12

Closed David-19940718 closed 8 months ago

David-19940718 commented 8 months ago

Hi, thank you for your valuable contribution!

I appreciate your work on the ovsam model. In your paper, you mentioned that the model can currently segment and recognize around 22,000 classes. However, when I tested the example provided in the demo, it appears that only approximately 1,000 classes can be recognized. I noticed that the names field is defined in this file.

Could you please clarify whether my understanding is correct? If I have misunderstood, kindly point out the correct information. Thank you very much for your clarification.

HarborYuan commented 8 months ago

Hi @David-19940718 ,

Thanks for your interest in our work.

Our OVSAM is flexible on the category scale, and the number of categories that can be recognized is mainly constrained by two factors: 1. vocabulary size during training; and 2. vocabulary list provided during inference. During inference, you can specify a subset of the training vocabulary to focus on a subset of the trained categories. We have tried different training scales with different vocabulary sizes (Please refer to Page 10 in the paper). The demo checkpoint is trained on V3Det + LVIS. During inference, we provide the LVIS vocabulary list for the demo only since there are some categories that are less easily recognized by humans in the larger vocabulary. You can try to define your own vocabulary list based on your interests. We will also release more checkpoints and more training codes in the future.

Hope this can help and please let me know if you have any other questions.

hanoonaR commented 8 months ago

Hi @HarborYuan,

Thank you for sharing your great work. I have a connected question. How can we modify the vocabulary at inference to a subset of the trained categories?

Should CLIP text embeddings of the corresponding class labels in the subset vocabulary be computed to replace the value of cls_embed in this line? If so, what are the text prompts used to generate the embeddings eg, "A photo of a <category>" or "<category>".

Thank you.

HarborYuan commented 8 months ago

Hi @hanoonaR ,

Thanks for your interest in our work. I have read several of your previous works and they are very enlightening.

To modify the vocabulary, you need to generate embeddings via the gen_cls tool. Currently, it supports generating a vocabulary set based on the class names of mmdet datasets. For example, if you want to generate the embedding based on the COCO class names. You may need to run the following command:

bash tools/dist.sh gen_cls seg/configs/ovsam/ovsam_coco_rn50x16_point.py 8

The script will read the COCO label list from the dataset here and the language encoder here.

If you want a customized label list, you can inherit it from the COCO dataset but with your names.

Hope this can help. Free feel to let me know if you have any other questions.