RozDavid / LanguageGroundedSemseg

Implementation for ECCV 2022 paper Language-Grounded Indoor 3D Semantic Segmentation in the Wild
98 stars 14 forks source link

The result of clip only #8

Closed yhyang-myron closed 1 year ago

yhyang-myron commented 1 year ago

Hi, I see there is a result of clip only, which is 27.73 miou. How is this result trained?Is it the result of pretrain stage or adding the fine-tune methods in it? Thanks a lot!

RozDavid commented 1 year ago

Hey @yhyang-myron,

I understand your confusion, maybe I didn't phrase it properly in the paper. CLIP only refers to the pretraining stage, where we used the text features for anchoring, and after that we used the standard finetuning method with unweighted cross entropy loss. So the only part means we didn't use the class-balanced focal loss or the instance sampling for tail categories.

Hope this clears it up, but let me know if you have any more questions!

Cheers, David

yhyang-myron commented 1 year ago

I see, thank you!