Open A91A981E opened 2 days ago
Thanks for your outstanding work. But I have a few questions about your code.
- In Fig. 4 in your paper, the LSTM encoders (i.e. the
AtCAF.uni_acoustic_enc
andAtCAF.uni_visual_enc
) take the data as inputs for unimodal embedding. But inmodel.py
this operation is after fusion operation. Is that correct? And what areAtCAF.uni_vision_encoder
andAtCAF.uni_audio_encoder
used for, which are not shown in your paper?- What are the hyperparameters set in your experiments? I have run grid search on all possible values you provide in Sec 4.4 on ur_funny dataset, but the highest acc was
0.7062
. The namespace is:Namespace(f='', dataset='ur_funny', data_path='datasets', iid_setting=False, ood_setting=False, seven_class=False, npy_path='npy_folder', dropout_a=0.1, dropout_v=0.1, dropout_prj=0.1, use_kmean=True, audio_kmean_size=50, text_kmean_size=50, vision_kmean_size=50, whether_debias_unimodal=True, whether_debias_audio=False, whether_debias_text=True, whether_debias_vision=False, audio_debias_layers=3, vision_debias_layers=3, text_debias_layers=1, attn_dropout_debias=0.1, audio_mlp_hidden_size=32, vision_mlp_hidden_size=32, text_mlp_hidden_size=32, whether_use_counterfactual=True, counterfactual_attention_type='uniform', num_layers_counterfactual_attention=2, model_dim_self=30, num_heads_self=5, num_layers_self=1, attn_dropout_self=0.1, model_dim_cross=30, num_heads_cross=5, num_layers_cross=2, attn_dropout_cross=0.1, relu_dropout=0.1, res_dropout=0.1, attn_mask=True, embed_dropout=0.25, vonly=True, aonly=True, lonly=True, multiseed=False, contrast=True, add_va=False, n_layer=1, cpc_layers=1, d_vh=32, d_ah=32, d_vout=32, d_aout=32, bidirectional=False, d_prjh=64, pretrain_emb=768, mem_size=1, mmilb_mid_activation='ReLU', mmilb_last_activation='Tanh', cpc_activation='Tanh', batch_size=64, clip=1.0, lr_main=0.001, lr_bert=5e-05, lr_mmilb=0.001, alpha=0.0, beta=0.0, eta=0.4, weight_decay_main=0.0001, weight_decay_bert=1e-06, weight_decay_club=0.0001, optim='Adam', num_epochs=40, when=20, patience=5, update_batch=1, log_interval=100, seed=1111, n_train=7614, n_valid=980, n_test=994, word2id=None, d_tin=768, d_vin=371, d_ain=81, data='ur_funny', n_class=2, criterion='CrossEntropyLoss')
And the best epoch was 3 with evaluating results as :
Confusion Matrix (pos/neg) : [[363 141] [151 339]] Classification Report (pos/neg) : precision recall f1-score support 0 0.70623 0.72024 0.71316 504 1 0.70625 0.69184 0.69897 490 accuracy 0.70624 994 macro avg 0.70624 0.70604 0.70607 994 weighted avg 0.70624 0.70624 0.70617 994 Accuracy (pos/neg) 0.7062374245472837
Oh, I have changed
pos_y
andneg_y
inmodules.encoders.MMILB.forward
to:pos_y = y[labels.squeeze() == 1] neg_y = y[labels.squeeze() == 0]
Would it be possible for you to provide the detailed hyperparameters?
Thank you for your attention to our work.
1.You are right. The role of LSTM in unimodal application is not for fusion results, and there are some discrepancies with the text. In the implement, the output of LSTM is not used for modal fusion but for auxiliary tasks. AtCAF.uni_audio_encoder
and AtCAF.uni_vision_encoder
are two transformers for mapping audio/vision embedding in feature extraction procedure.
2.Here is the hyper-parameter setting for ur-funny dataset. {'f': '', 'dataset': 'ur_funny', 'data_path': 'datasets', 'iid_setting': False, 'ood_setting': False, 'seven_class': False, 'npy_path': 'npy_folder', 'dropout_a': 0.1, 'dropout_v': 0.1, 'dropout_prj': 0.1, 'use_kmean': True, 'kmean_size': 50, 'audio_kmean_size': 50, 'text_kmean_size': 50, 'vision_kmean_size': 50, 'whether_debias_unimodal': True, 'whether_debias_audio': False, 'whether_debias_text': True, 'whether_debias_vision': False, 'audio_debias_layers': 3, 'vision_debias_layers': 3, 'text_debias_layers': 3, 'attn_dropout_debias': 0.1, 'audio_mlp_hidden_size': 32, 'vision_mlp_hidden_size': 16, 'text_mlp_hidden_size': 128, 'whether_use_counterfactual': True, 'whether_use_counterfactual_ta': True, 'whether_use_counterfactual_tv': True, 'counterfactual_attention_type': 'random', 'num_layers_counterfactual_attention': 2, 'model_dim_self': 30, 'num_heads_self': 5, 'num_layers_self': 3, 'attn_dropout_self': 0.1, 'model_dim_cross': 30, 'num_heads_cross': 5, 'num_layers_cross': 6, 'attn_dropout_cross': 0.0, 'relu_dropout': 0.1, 'res_dropout': 0.0, 'attn_mask': True, 'embed_dropout': 0.1, 'vonly': True, 'aonly': True, 'lonly': True, 'multiseed': False, 'contrast': True, 'add_va': True, 'n_layer': 1, 'cpc_layers': 1, 'd_vh': 16, 'd_ah': 16, 'd_vout': 16, 'd_aout': 16, 'bidirectional': False, 'd_prjh': 128, 'pretrain_emb': 768, 'mem_size': 3, 'mmilb_mid_activation': 'ReLU', 'mmilb_last_activation': 'Tanh', 'cpc_activation': 'Tanh', 'batch_size': 64, 'clip': 7.0, 'lr_main': 8.057482200029674e-06, 'lr_bert': 2.97165693318459e-06, 'lr_mmilb': 0.00014583572596242226, 'alpha': 0.025, 'beta': 0.4, 'eta': 0.1, 'weight_decay_main': 0.0001, 'weight_decay_bert': 0.0001, 'weight_decay_club': 0.0001, 'optim': 'Adam', 'num_epochs': 40, 'when': 20, 'patience': 2, 'update_batch': 1, 'log_interval': 100, 'seed': 1111, 'n_train': 7614, 'n_valid': 980, 'n_test': 994, 'word2id': None, 'd_tin': 768, 'd_vin': 371, 'd_ain': 81, 'data': 'ur_funny', 'n_class': 2, 'criterion': 'CrossEntropyLoss'}
It's recommended that you might want to adjust the learning rate or other parameters appropriately, as there may be slight learning differences between different package versions. Once again, thank you for your attention to our work, and I wish you success in your research.
So the actual unimodal encoders of visual and acoustic are self-attention modules? I get it. But I am still confused about what the auxiliary tasks are trying to optimize? Here is my understanding: For visual and acoustic modalities, the LSTM takes the original data as input and it seems to be a parallel branch of the SelfAttn unimodal encoder together with fusion module. Then the CPC loss is calculated, trying to preserve SelfAttn unimodal information within the fused feature by reconstructing x_pred
and comparing it with the LSTM unimodal feature. Is there any assumption about features from SelfAttn and LSTM respectively or what are these auxiliary tasks trying to optimize?
Besides, I set these values according to your reply, but the result is 0.6670
. Are there still any places to modify? For example, in solver.Solver.train_and_eval.train
, the batch_data
is expected to provide 14 values, while only 13 are returned in data_loader.get_loader.collate_fn
.
Your understanding is correct. For the auxiliary task, we do not make any innovation. you can refer to MMIM
,our code is also modified based on MMIM. You can find the answers in the paper and its repository.
Actually, we find that the XLNet Text Encoder is very sensitive to specific dataset, you had better make the grid search space wider and make more attempts. Once again, thank you for your attention to our work, and I wish you success in your research.
Thanks for your reply.
Thanks for your outstanding work. But I have a few questions about your code.
In Fig. 4 in your paper, the LSTM encoders (i.e. the
AtCAF.uni_acoustic_enc
andAtCAF.uni_visual_enc
) take the data as inputs for unimodal embedding. But inmodel.py
this operation is after fusion operation. Is that correct? And what areAtCAF.uni_vision_encoder
andAtCAF.uni_audio_encoder
used for, which are not shown in your paper?What are the hyperparameters set in your experiments? I have run grid search on all possible values you provide in Sec 4.4 on ur_funny dataset, but the highest acc was
0.7062
. The namespace is:And the best epoch was 3 with evaluating results as :
Accuracy (pos/neg) 0.7062374245472837
Would it be possible for you to provide the detailed hyperparameters?