Closed synsin0 closed 1 year ago
Hi @synsin0,
Yes, we only use ov-seg (MaskFormer) to produce mask proposals, leaving its class predictions unused in demo. The reason is, like you also mentioned, the performance would become worse if we use it. We conjecture this is because the open-vocabulary classifier of ov-seg (MaskFormer) is trained with COCO-171, resulting it fitting to these 171 classes while being unable to handle the diverse cases in the demo.
If you only want to use ov-seg (MaskFormer) class prediction (The MaskFomer only results in Table 5), you may want to turn CLIP_ENSEMBLE
to False
as in here.
I close this issue, feel free to reopen it if you have further questions.
Thanks for your great work. I see that for the demo config, the mask is from ov-seg, but the classification is completely dependent on clip classification (L486: # only clip model predictions are used). At table 5, I may understand using each feature(either from ovseg and clip) is able to classify. However, if I turned clip_ensemble to False, the pred picture become totally wrong. Does the ov-seg only produces mask proposals for clip adapter in the demo? How to use ovseg feature only for mask classification?