GenjiB / LAVISH

Vision Transformers are Parameter-Efficient Audio-Visual Learners
85 stars 8 forks source link

Can't get the accuracy of AVE reported in the paper with vit_base(75.3%) #17

Open Lecooo opened 1 year ago

Lecooo commented 1 year ago

Hi, We used this config to train AVE task on a 3090, and we used the procesed data you provided, but the accuracy we got is 73.31

python3 /code/AVE/main_trans.py --Adapter_downsample=8 --batch_size=4 --early_stop=5 --epochs=50 --is_audio_adapter_p1=1 --is_audio_adapter_p2=1 --is_audio_adapter_p3=0 --is_before_layernorm=1 --is_bn=1 --is_fusion_before=1 --is_gate=1 --is_post_layernorm=1 --is_vit_ln=0 --lr=5e-06 --lr_mlp=4e-06 --mode=train --num_conv_group=2 --num_tokens=2 --num_workers=8 --is_multimodal=1 --vis_encoder_type=vit

And the https://github.com/GenjiB/LAVISH/blob/97722b0424e8dd44659f447fe8731c675fa98da8/AVE/nets/net_trans.py#L435 doesn't used in forward_swin, this code will cause a "shape don't match" error.

GenjiB commented 1 year ago

Thanks for pointing out. Can you try these hyper-parameters? I used different parameters for ViTs and Swin

--batch_size=2 --early_stop=5 --epochs=50 --is_audio_adapter_p1=1 --is_audio_adapter_p2=1 --is_audio_adapter_p3=0 --is_before_layernorm=0 --is_bn=0 --is_fusion_before=1 --is_gate=1 --is_post_layernorm=0 --is_vit_ln=1 --lr=3e-05 --lr_mlp=6e-06 --mode=train --model=MMIL_Net --num_conv_group=4 --num_tokens=8

Lecooo commented 1 year ago

Thanks for your reply. I have tried the hyper-parameters you provided, and the accuracy have achieved 75.2%. It's interesting that the total params under this setting is 105.5M, which is less than 107.2M mentioned is your paper. image

praveena2j commented 6 months ago

@Lecooo May I know how many epochs it took to achieve this result thanks