WasifurRahman / BERT_multimodal_transformer

193 stars 29 forks source link

baseline using only visual or acoustic features. #13

Closed AiliAili closed 3 years ago

AiliAili commented 3 years ago

Thank you very much for sharing this great work.

I am wondering whether you try baselines based solely on visual or acoustic features. I adopted TFN structure to get baseline over visual features, but results in random guessing.

Could you please point me what's the possible issues with my usage of visual or acoustic features?

Thank you.

RE-N-Y commented 3 years ago

There is a wide range of reasons why multimodal models perform poorly. From personal experience, in multimodal models pre-trained NLP models perform the majority of the heavy lifting and visual/acoustic modalities provide "complementary" information not present in NLP models.

I do not believe that we have tried baselines with visual/acoustic features sorely, but I suspect a few possible issues.

  1. Feature engineering

In MOSI/MOSEI, often visual/acoustic features contain highly correlated features and zero columns, it may be worthwhile putting time into preprocessing and using dimensionality reduction techniques. The repository provides MOSI / MOSEI dataset with our preprocessing to ensure consistency, so it might be worth checking out.

  1. Simple models

Before attempting TFN, did you try classical models such as SVM, XGBoost, Linear Regression? I'd first verify the performances on these algorithms to verify whether the poor performance comes from the model or the quality of visual/acoustic features.

To elaborate on TFN, I believe the authors have attempted baselines on acoustic/visual modalities and it seems that even the SVM baseline performs significantly better than a random guess. So my recommendation is to attempt baselines on SVM first to rule out issues with modeling code.

A long reply but hope it helps!

AiliAili commented 3 years ago

Thanks for your suggestions. I will try it out and check why this happens.

Cheers.