Wuziyi616 / EventCLIP

Code release for paper EventCLIP: Adapting CLIP for Event-based Object Recognition
MIT License
13 stars 0 forks source link

Problem about the backbone #5

Closed Hoantrbl closed 2 months ago

Hoantrbl commented 2 months ago

Thanks for your solid work. I'm a little confused the usage of the backbone.

In your paper, you said that you have utilized the ViT-L/14 image encoder in your "Our Implementation Details". Is this only for zero-shot implementation?

image

Howvever, in fine-tuning experiments, you said you utilize the ViT-B/16.

image

Can I understand that ViT-L/14 for zero-shot and ViT-B/16 for fine-tuning?

That's important. The latest EventBind has surpasses your work performance, which said that you use ViT-L/14 for fine-tuning.

image

You can refer to https://github.com/jiazhou-garland/EventBind/issues/4#issuecomment-2306309917. I also ask the same question to the authors of EventBind.

Wuziyi616 commented 2 months ago

There might be some confusion here, sorry for not putting it very clear in the paper. So, there are 3 settings involved here:

  1. Zero-shot cls, where we don't train anything, just run model inference
  2. Few-shot cls, where we only train the added feature adapters
  3. Fine-tuning, where we train both the feature adapter and the image encoder (i.e. the backbone)

For 1 and 2, I was using ViT-L/14, while for 3, I used ViT-B/16, which is what E-CLIP was using back then. I didn't follow the recent works in this line, so I'm not sure what EventBind uses. But the results you showed in Table 6, they are definitely using ViT/B-16.

Hoantrbl commented 2 months ago

Thanks a lot to your active response ! I think it must be the wrong of EventBind(ECCV 2024). In its main experiments, they claimed that you uses ViT-L/14 for Fine-tuning experiments.

image
Wuziyi616 commented 2 months ago

I tried to dig out some wandb logs. The results in the red box are the 20-shot fine-tuning results using ViT-B-16 (as you can see in the run names), it achieves ~38.28 accuracy. The one in the blue box uses a larger ViT/L-14, which achieves much higher accuracy than ViT/B-16.

图片

Wuziyi616 commented 2 months ago

Thanks a lot to your active response ! I think it must be the wrong of EventBind(ECCV 2024). In its main experiments, they claimed that you uses ViT-L/14 for Fine-tuning experiments.

Yeah could be. I don't have results that fine-tune ViT/L-14 on NIN though. So not sure how EventCLIP will perform.

Hoantrbl commented 2 months ago

Thanks a lot to your active response ! I think it must be the wrong of EventBind(ECCV 2024). In its main experiments, they claimed that you uses ViT-L/14 for Fine-tuning experiments.

Yeah could be. I don't have results that fine-tune ViT/L-14 on NIN though. So not sure how EventCLIP will perform.

Okay! I also raise the issues to ask the authors of EventBind, Thanks for your positive response again !

Wuziyi616 commented 2 months ago

Of course. Feel free to reopen if you have further questions.