Questions about the algorithm

A91A981E commented 2 days ago

Thanks for your outstanding work. But I have a few questions about your code.

In Fig. 4 in your paper, the LSTM encoders (i.e. the AtCAF.uni_acoustic_enc and AtCAF.uni_visual_enc) take the data as inputs for unimodal embedding. But in model.py this operation is after fusion operation. Is that correct? And what are AtCAF.uni_vision_encoder and AtCAF.uni_audio_encoder used for, which are not shown in your paper?

What are the hyperparameters set in your experiments? I have run grid search on all possible values you provide in Sec 4.4 on ur_funny dataset, but the highest acc was 0.7062. The namespace is:

Namespace(f='', dataset='ur_funny', data_path='datasets', iid_setting=False, ood_setting=False, seven_class=False, npy_path='npy_folder', dropout_a=0.1, dropout_v=0.1, dropout_prj=0.1, use_kmean=True, audio_kmean_size=50, text_kmean_size=50, vision_kmean_size=50, whether_debias_unimodal=True, whether_debias_audio=False, whether_debias_text=True, whether_debias_vision=False, audio_debias_layers=3, vision_debias_layers=3, text_debias_layers=1, attn_dropout_debias=0.1, audio_mlp_hidden_size=32, vision_mlp_hidden_size=32, text_mlp_hidden_size=32, whether_use_counterfactual=True, counterfactual_attention_type='uniform', num_layers_counterfactual_attention=2, model_dim_self=30, num_heads_self=5, num_layers_self=1, attn_dropout_self=0.1, model_dim_cross=30, num_heads_cross=5, num_layers_cross=2, attn_dropout_cross=0.1, relu_dropout=0.1, res_dropout=0.1, attn_mask=True, embed_dropout=0.25, vonly=True, aonly=True, lonly=True, multiseed=False, contrast=True, add_va=False, n_layer=1, cpc_layers=1, d_vh=32, d_ah=32, d_vout=32, d_aout=32, bidirectional=False, d_prjh=64, pretrain_emb=768, mem_size=1, mmilb_mid_activation='ReLU', mmilb_last_activation='Tanh', cpc_activation='Tanh', batch_size=64, clip=1.0, lr_main=0.001, lr_bert=5e-05, lr_mmilb=0.001, alpha=0.0, beta=0.0, eta=0.4, weight_decay_main=0.0001, weight_decay_bert=1e-06, weight_decay_club=0.0001, optim='Adam', num_epochs=40, when=20, patience=5, update_batch=1, log_interval=100, seed=1111, n_train=7614, n_valid=980, n_test=994, word2id=None, d_tin=768, d_vin=371, d_ain=81, data='ur_funny', n_class=2, criterion='CrossEntropyLoss')

And the best epoch was 3 with evaluating results as :


Confusion Matrix (pos/neg) :
[[363 141]
[151 339]]
Classification Report (pos/neg) :
          precision    recall  f1-score   support

       0    0.70623   0.72024   0.71316       504
       1    0.70625   0.69184   0.69897       490

accuracy                        0.70624       994
macro avg    0.70624   0.70604   0.70607       994
weighted avg    0.70624   0.70624   0.70617       994

Accuracy (pos/neg) 0.7062374245472837

Oh, I have changed `pos_y` and `neg_y` in `modules.encoders.MMILB.forward` to:
```python
pos_y = y[labels.squeeze() == 1]
neg_y = y[labels.squeeze() == 0]

Would it be possible for you to provide the detailed hyperparameters?

TheShy-Dream commented 2 days ago

Thanks for your outstanding work. But I have a few questions about your code.

In Fig. 4 in your paper, the LSTM encoders (i.e. the AtCAF.uni_acoustic_enc and AtCAF.uni_visual_enc) take the data as inputs for unimodal embedding. But in model.py this operation is after fusion operation. Is that correct? And what are AtCAF.uni_vision_encoder and AtCAF.uni_audio_encoder used for, which are not shown in your paper?
What are the hyperparameters set in your experiments? I have run grid search on all possible values you provide in Sec 4.4 on ur_funny dataset, but the highest acc was 0.7062. The namespace is:

Namespace(f='', dataset='ur_funny', data_path='datasets', iid_setting=False, ood_setting=False, seven_class=False, npy_path='npy_folder', dropout_a=0.1, dropout_v=0.1, dropout_prj=0.1, use_kmean=True, audio_kmean_size=50, text_kmean_size=50, vision_kmean_size=50, whether_debias_unimodal=True, whether_debias_audio=False, whether_debias_text=True, whether_debias_vision=False, audio_debias_layers=3, vision_debias_layers=3, text_debias_layers=1, attn_dropout_debias=0.1, audio_mlp_hidden_size=32, vision_mlp_hidden_size=32, text_mlp_hidden_size=32, whether_use_counterfactual=True, counterfactual_attention_type='uniform', num_layers_counterfactual_attention=2, model_dim_self=30, num_heads_self=5, num_layers_self=1, attn_dropout_self=0.1, model_dim_cross=30, num_heads_cross=5, num_layers_cross=2, attn_dropout_cross=0.1, relu_dropout=0.1, res_dropout=0.1, attn_mask=True, embed_dropout=0.25, vonly=True, aonly=True, lonly=True, multiseed=False, contrast=True, add_va=False, n_layer=1, cpc_layers=1, d_vh=32, d_ah=32, d_vout=32, d_aout=32, bidirectional=False, d_prjh=64, pretrain_emb=768, mem_size=1, mmilb_mid_activation='ReLU', mmilb_last_activation='Tanh', cpc_activation='Tanh', batch_size=64, clip=1.0, lr_main=0.001, lr_bert=5e-05, lr_mmilb=0.001, alpha=0.0, beta=0.0, eta=0.4, weight_decay_main=0.0001, weight_decay_bert=1e-06, weight_decay_club=0.0001, optim='Adam', num_epochs=40, when=20, patience=5, update_batch=1, log_interval=100, seed=1111, n_train=7614, n_valid=980, n_test=994, word2id=None, d_tin=768, d_vin=371, d_ain=81, data='ur_funny', n_class=2, criterion='CrossEntropyLoss')

And the best epoch was 3 with evaluating results as :

Confusion Matrix (pos/neg) :
[[363 141]
 [151 339]]
Classification Report (pos/neg) :
              precision    recall  f1-score   support

           0    0.70623   0.72024   0.71316       504
           1    0.70625   0.69184   0.69897       490

    accuracy                        0.70624       994
   macro avg    0.70624   0.70604   0.70607       994
weighted avg    0.70624   0.70624   0.70617       994

Accuracy (pos/neg)  0.7062374245472837

Oh, I have changed pos_y and neg_y in modules.encoders.MMILB.forward to:

pos_y = y[labels.squeeze() == 1]
neg_y = y[labels.squeeze() == 0]

Would it be possible for you to provide the detailed hyperparameters?

Thank you for your attention to our work. 1.You are right. The role of LSTM in unimodal application is not for fusion results, and there are some discrepancies with the text. In the implement, the output of LSTM is not used for modal fusion but for auxiliary tasks. AtCAF.uni_audio_encoder and AtCAF.uni_vision_encoder are two transformers for mapping audio/vision embedding in feature extraction procedure. 2.Here is the hyper-parameter setting for ur-funny dataset. {'f': '', 'dataset': 'ur_funny', 'data_path': 'datasets', 'iid_setting': False, 'ood_setting': False, 'seven_class': False, 'npy_path': 'npy_folder', 'dropout_a': 0.1, 'dropout_v': 0.1, 'dropout_prj': 0.1, 'use_kmean': True, 'kmean_size': 50, 'audio_kmean_size': 50, 'text_kmean_size': 50, 'vision_kmean_size': 50, 'whether_debias_unimodal': True, 'whether_debias_audio': False, 'whether_debias_text': True, 'whether_debias_vision': False, 'audio_debias_layers': 3, 'vision_debias_layers': 3, 'text_debias_layers': 3, 'attn_dropout_debias': 0.1, 'audio_mlp_hidden_size': 32, 'vision_mlp_hidden_size': 16, 'text_mlp_hidden_size': 128, 'whether_use_counterfactual': True, 'whether_use_counterfactual_ta': True, 'whether_use_counterfactual_tv': True, 'counterfactual_attention_type': 'random', 'num_layers_counterfactual_attention': 2, 'model_dim_self': 30, 'num_heads_self': 5, 'num_layers_self': 3, 'attn_dropout_self': 0.1, 'model_dim_cross': 30, 'num_heads_cross': 5, 'num_layers_cross': 6, 'attn_dropout_cross': 0.0, 'relu_dropout': 0.1, 'res_dropout': 0.0, 'attn_mask': True, 'embed_dropout': 0.1, 'vonly': True, 'aonly': True, 'lonly': True, 'multiseed': False, 'contrast': True, 'add_va': True, 'n_layer': 1, 'cpc_layers': 1, 'd_vh': 16, 'd_ah': 16, 'd_vout': 16, 'd_aout': 16, 'bidirectional': False, 'd_prjh': 128, 'pretrain_emb': 768, 'mem_size': 3, 'mmilb_mid_activation': 'ReLU', 'mmilb_last_activation': 'Tanh', 'cpc_activation': 'Tanh', 'batch_size': 64, 'clip': 7.0, 'lr_main': 8.057482200029674e-06, 'lr_bert': 2.97165693318459e-06, 'lr_mmilb': 0.00014583572596242226, 'alpha': 0.025, 'beta': 0.4, 'eta': 0.1, 'weight_decay_main': 0.0001, 'weight_decay_bert': 0.0001, 'weight_decay_club': 0.0001, 'optim': 'Adam', 'num_epochs': 40, 'when': 20, 'patience': 2, 'update_batch': 1, 'log_interval': 100, 'seed': 1111, 'n_train': 7614, 'n_valid': 980, 'n_test': 994, 'word2id': None, 'd_tin': 768, 'd_vin': 371, 'd_ain': 81, 'data': 'ur_funny', 'n_class': 2, 'criterion': 'CrossEntropyLoss'}

It's recommended that you might want to adjust the learning rate or other parameters appropriately, as there may be slight learning differences between different package versions. Once again, thank you for your attention to our work, and I wish you success in your research.

A91A981E commented 2 days ago

So the actual unimodal encoders of visual and acoustic are self-attention modules? I get it. But I am still confused about what the auxiliary tasks are trying to optimize? Here is my understanding: For visual and acoustic modalities, the LSTM takes the original data as input and it seems to be a parallel branch of the SelfAttn unimodal encoder together with fusion module. Then the CPC loss is calculated, trying to preserve SelfAttn unimodal information within the fused feature by reconstructing x_pred and comparing it with the LSTM unimodal feature. Is there any assumption about features from SelfAttn and LSTM respectively or what are these auxiliary tasks trying to optimize?

Besides, I set these values according to your reply, but the result is 0.6670. Are there still any places to modify? For example, in solver.Solver.train_and_eval.train, the batch_data is expected to provide 14 values, while only 13 are returned in data_loader.get_loader.collate_fn.

TheShy-Dream commented 2 days ago

Your understanding is correct. For the auxiliary task, we do not make any innovation. you can refer to MMIM ,our code is also modified based on MMIM. You can find the answers in the paper and its repository.

Actually, we find that the XLNet Text Encoder is very sensitive to specific dataset, you had better make the grid search space wider and make more attempts. Once again, thank you for your attention to our work, and I wish you success in your research.

A91A981E commented 2 days ago

Thanks for your reply.

TheShy-Dream / AtCAF

Questions about the algorithm #1