GenjiB / LAVISH

Vision Transformers are Parameter-Efficient Audio-Visual Learners
92 stars 8 forks source link

Weird results using swin-L, could not get 81.1 of AVE #25

Open Blissy-32 opened 2 months ago

Blissy-32 commented 2 months ago

I have tried 3 time, the result get close to 79.4 after 4 or 3 epoch, than the result decline gradually with loss decline at the same time, and i dont know why. Here is my config: python3 main_trans.py --Adapter_downsample=8 --batch_size=2 --early_stop=5 --epochs=50 --is_audio_adapter_p1=1 --is_audio_adapter_p2=1 --is_audio_adapter_p3=0 --is_before_layernorm=1 --is_bn=1 --is_fusion_before=1 --is_gate=1 --is_post_layernorm=1 --is_vit_ln=0 --lr=5e-05 --lr_mlp=4e-06 --mode=train --num_conv_group=2 --num_tokens=2 --num_workers=16 -- --is_multimodal=1 --vis_encoder_type=swin also --model_save_dir does not work

HHH123333 commented 3 weeks ago

Have you solved it yet? The best result I ran was 79.23%, and adjusting parameters such as batch_size did not improve

GenjiB commented 3 weeks ago

Can you try v2 script? I improve the reproducibility in that version. I guess the dataset is a bit small. Randomness may impact the results.