GeWu-Lab / BML_TPAMI2024

The repo for "On-the-fly Modulation for Balanced Multimodal Learning", T-PAMI 2024
3 stars 0 forks source link

A Question on UCF101 Accuracy #1

Open hubaak opened 2 weeks ago

hubaak commented 2 weeks ago

In the paper, an image-net pretrained resnet18 model can achieve a score of 77.2 with only RGB modality. However, there is no code for UCF101 in the repo. I tried to train a resnet18 according to the settings in the paper and its accuracy is 0.43 with a setting of (batch_size, lr, epoch) = (32, 1E-3, 800). So I'm confused by such a performance gap. Can you provide some implementation details or the code for UCF101? image BTW, 3D resnet18 with a lot of tricks has a score of 74.1 in https://arxiv.org/pdf/2103.05905v2, so I think it's a little bit wield a resnet18 with only RGB modality to achieve a performance that easily.

echo0409 commented 1 week ago

Thank you for the question.

Here are our settings:  batch size=64, lr=1e-4,scheduler = step_LR, step=40, decay_ratio=0.1, optimizer = sgd, weiht_decay = 1e-4

We use imagenet pre-trained ResNet18 as backbone. For RGB modality, we evenly pick 3 frames for each sample. For optical flow modality, we stack the horizontal vector u and vertical vector v in the way of [u,v,u] to form three channels as one frame and select 3 frames in total.

hubaak commented 1 week ago

Thank you for the question.

Here are our settings:  batch size=64, lr=1e-4,scheduler = step_LR, step=40, decay_ratio=0.1, optimizer = sgd, weiht_decay = 1e-4

We use imagenet pre-trained ResNet18 as backbone. For RGB modality, we evenly pick 3 frames for each sample. For optical flow modality, we stack the horizontal vector u and vertical vector v in the way of [u,v,u] to form three channels as one frame and select 3 frames in total.

Thanks a lot for providing your settings! I'll try this again with the setting.