Closed SuleBai closed 7 months ago
Yes this is a bug, and we fixed it - it seems like we missed it while refactoring our code. Also, you are right about the main differences, and some other differences would include fine-tuning methodology where we selectively finetune few layers within the attention layer rather than the full module.
Hi, thanks for your great work!
When I look through the code, I found in the inference phase, the clip image encoder is forwarded twice, is this a bug here or why is it forwarded twice?
https://github.com/KU-CVLAB/CAT-Seg/blob/3062d4abda7884f35ff8650784c882b225783978/cat_seg/cat_seg_model.py#L202
https://github.com/KU-CVLAB/CAT-Seg/blob/3062d4abda7884f35ff8650784c882b225783978/cat_seg/cat_seg_model.py#L205
Besides, the main difference between the CVPR version and the previous arxiv version is that you remove the additional backbone(Swin) and managed to finetune the CLIP text encoder, am I right?