This is the official implementation for Provable Dynamic Fusion for Low-Quality Multimodal Data (ICML 2023) by Qingyang Zhang, Haitao Wu, Changqing Zhang , Qinghua Hu, Huazhu Fu, Joey Tianyi Zhou and Xi Peng
pip install -r requirements.txt
Text-Image Classification:
Step 1: Download food101 and MVSA_Single and put them in the folder datasets.
Step 2: Prepare the train/dev/test splits jsonl files. We follow the MMBT settings and provide them in corresponding folders.
Step 3 (optional): If you want use Glove model for Bow model, you can download glove.840B.300d.txt and put it in the folder datasets/glove_embeds. For bert model, you can download bert-base-uncased (Google Drive Link ) and put in the root folder bert-base-uncased/.
RGBD Scene Recognition:
Step 1: Download NYUD2 and SUNRGBD and put them in the folder datasets.
Feel free to use Baidu Netdisk for food101 MVSA_Single NYUD2 SUNRGBD.
We provide the trained models at Baidu Netdisk.
Pretrained bert model at Baidu Netdisk.
We use the pytorch official pretrained resnet18 in RGB-D classification tasks, which can be downloaded from this link.
Note: Sheels for reference are provided in the folder shells
To run our method on benchmark datasets:
python train_qmf.py --batch_sz 16 --gradient_accumulation_steps 40 \
--savedir ./saved/$task --name $name --data_path ./datasets/ \
--task $task --task_type $task_type --model $model --num_image_embeds 3 \
--freeze_txt 5 --freeze_img 3 --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed $i --df true --noise 0.0
To run tmc:
python train_tmc.py --batch_sz 16 --gradient_accumulation_steps 40 \
--savedir ./saved/$task --name $name --data_path ./datasets/ \
--task $task --task_type $task_type --model $model --num_image_embeds 3 \
--freeze_txt 5 --freeze_img 3 --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed $i --df true --noise 0.0
To run Others:
python train.py --batch_sz 16 --gradient_accumulation_steps 40 \
--savedir ./saved/$task --name $name --data_path ./datasets/ \
--task $task --task_type $task_type --model $model --num_image_embeds 3 \
--freeze_txt 5 --freeze_img 3 --patience 5 --dropout 0.1 --lr 5e-05 --warmup 0.1 --max_epochs 100 --seed $i --df true --noise 0.0
If our QMF or the idea of dynamic multimodal fusion methods are helpful in your research, please consider citing our paper:
@inproceedings{zhang2023provable,
title={Provable Dynamic Fusion for Low-Quality Multimodal Data},
author={Zhang, Qingyang and Wu, Haitao and Zhang, Changqing and Hu, Qinghua and Fu, Huazhu and Zhou, Joey Tianyi and Peng, Xi},
booktitle={International Conference on Machine Learning},
year={2023}
}
The code is inspired by TMC: Trusted Multi-View Classification and Confidence-Aware Learning for Deep Neural Networks.
There are many interesting works related to this paper:
For any additional questions, feel free to email qingyangzhang@tju.edu.cn.