Closed Hoantrbl closed 2 months ago
There might be some confusion here, sorry for not putting it very clear in the paper. So, there are 3 settings involved here:
For 1 and 2, I was using ViT-L/14, while for 3, I used ViT-B/16, which is what E-CLIP was using back then. I didn't follow the recent works in this line, so I'm not sure what EventBind uses. But the results you showed in Table 6, they are definitely using ViT/B-16.
Thanks a lot to your active response ! I think it must be the wrong of EventBind(ECCV 2024). In its main experiments, they claimed that you uses ViT-L/14 for Fine-tuning experiments.
I tried to dig out some wandb logs. The results in the red box are the 20-shot fine-tuning results using ViT-B-16 (as you can see in the run names), it achieves ~38.28 accuracy. The one in the blue box uses a larger ViT/L-14, which achieves much higher accuracy than ViT/B-16.
Thanks a lot to your active response ! I think it must be the wrong of EventBind(ECCV 2024). In its main experiments, they claimed that you uses ViT-L/14 for Fine-tuning experiments.
Yeah could be. I don't have results that fine-tune ViT/L-14 on NIN though. So not sure how EventCLIP will perform.
Thanks a lot to your active response ! I think it must be the wrong of EventBind(ECCV 2024). In its main experiments, they claimed that you uses ViT-L/14 for Fine-tuning experiments.
Yeah could be. I don't have results that fine-tune ViT/L-14 on NIN though. So not sure how EventCLIP will perform.
Okay! I also raise the issues to ask the authors of EventBind, Thanks for your positive response again !
Of course. Feel free to reopen if you have further questions.
Thanks for your solid work. I'm a little confused the usage of the backbone.
In your paper, you said that you have utilized the ViT-L/14 image encoder in your "Our Implementation Details". Is this only for zero-shot implementation?
Howvever, in fine-tuning experiments, you said you utilize the ViT-B/16.
Can I understand that ViT-L/14 for zero-shot and ViT-B/16 for fine-tuning?
That's important. The latest EventBind has surpasses your work performance, which said that you use ViT-L/14 for fine-tuning.
You can refer to https://github.com/jiazhou-garland/EventBind/issues/4#issuecomment-2306309917. I also ask the same question to the authors of EventBind.