Documentation | Colab Notebook | Blog Post
A toolkit for incorporating multimodal data on top of text data for classification and regression tasks. It uses HuggingFace transformers as the base model for text features. The toolkit adds a combining module that takes the outputs of the transformer in addition to categorical and numerical features to produce rich multimodal features for downstream classification/regression layers. Given a pretrained transformer, the parameters of the combining module and transformer are trained based on the supervised task. For a brief literature review, check out the accompanying blog post on Georgian's Impact Blog.
The code was developed in Python 3.7 with PyTorch and Transformers 4.26.1.
The multimodal specific code is in multimodal_transformers
folder.
pip install multimodal-transformers
The following Hugging Face Transformers are supported to handle tabular data. See the documentation here.
This repository also includes two kaggle datasets which contain text data and rich tabular features
To quickly see these models in action on say one of the above datasets with preset configurations
$ python main.py ./datasets/Melbourne_Airbnb_Open_Data/train_config.json
Or if you prefer command line arguments run
$ python main.py \
--output_dir=./logs/test \
--task=classification \
--combine_feat_method=individual_mlps_on_cat_and_numerical_feats_then_concat \
--do_train \
--model_name_or_path=distilbert-base-uncased \
--data_path=./datasets/Womens_Clothing_E-Commerce_Reviews \
--column_info_path=./datasets/Womens_Clothing_E-Commerce_Reviews/column_info.json
main.py
expects a json
file detailing which columns in a dataset contain text,
categorical, or numerical input features. It also expects a path to the folder where
the data is stored as train.csv
, and test.csv
(and if given val.csv
).For more details on the arguments see
multimodal_exp_args.py
.
To see the modules come together in a notebook: \
combine feat method | description | requires both cat and num features |
---|---|---|
text_only | Uses just the text columns as processed by a HuggingFace transformer before final classifier layer(s). Essentially equivalent to HuggingFace's ForSequenceClassification models |
False |
concat | Concatenate transformer output, numerical feats, and categorical feats all at once before final classifier layer(s) | False |
mlp_on_categorical_then_concat | MLP on categorical feats then concat transformer output, numerical feats, and processed categorical feats before final classifier layer(s) | False (Requires cat feats) |
individual_mlps_on_cat_and_numerical_feats_then_concat | Separate MLPs on categorical feats and numerical feats then concatenation of transformer output, with processed numerical feats, and processed categorical feats before final classifier layer(s). | False |
mlp_on_concatenated_cat_and_numerical_feats_then_concat | MLP on concatenated categorical and numerical feat then concatenated with transformer output before final classifier layer(s) | True |
attention_on_cat_and_numerical_feats | Attention based summation of transformer outputs, numerical feats, and categorical feats queried by transformer outputs before final classifier layer(s). | False |
gating_on_cat_and_num_feats_then_sum | Gated summation of transformer outputs, numerical feats, and categorical feats before final classifier layer(s). Inspired by Integrating Multimodal Information in Large Pretrained Transformers which performs the mechanism for each token. | False |
weighted_feature_sum_on_transformer_cat_and_numerical_feats | Learnable weighted feature-wise sum of transformer outputs, numerical feats and categorical feats for each feature dimension before final classifier layer(s) | False |
In practice, taking the categorical and numerical features as they are and just tokenizing them and just concatenating them to
the text columns as extra text sentences is a strong baseline. To do that here, just specify all the categorical and numerical
columns as text columns and set combine_feat_method
to text_only
. For example for each of the included sample datasets in ./datasets
,
in train_config.json
change combine_feat_method
to text_only
and column_info_path
to ./datasets/{dataset}/column_info_all_text.json
.
In the experiments below this baseline corresponds to Combine Feat Method being unimodal
.
The following tables shows the results on the two included datasets's respective test sets, by running main.py Non specified parameters are the default.
Specific training parameters can be seen in datasets/Womens_Clothing_E-Commerce_Reviews/train_config.json
.
There are 2 text columns, 3 categorical columns, and 3 numerical columns.
Model | Combine Feat Method | F1 | PR AUC |
---|---|---|---|
Bert Base Uncased | text_only | 0.957 | 0.992 |
Bert Base Uncased | unimodal | 0.968 | 0.995 |
Bert Base Uncased | concat | 0.958 | 0.992 |
Bert Base Uncased | individual_mlps_on_cat_and_numerical_feats_then_concat | 0.959 | 0.992 |
Bert Base Uncased | attention_on_cat_and_numerical_feats | 0.959 | 0.992 |
Bert Base Uncased | gating_on_cat_and_num_feats_then_sum | 0.961 | 0.994 |
Bert Base Uncased | weighted_feature_sum_on_transformer_cat_and_numerical_feats | 0.962 | 0.994 |
Specific training parameters can be seen in datasets/Melbourne_Airbnb_Open_Data/train_config.json
.
There are 3 text columns, 74 categorical columns, and 15 numerical columns.
Model | Combine Feat Method | MAE | RMSE |
---|---|---|---|
Bert Base Multilingual Uncased | text_only | 82.74 | 254.0 |
Bert Base Multilingual Uncased | unimodal | 79.34 | 245.2 |
Bert Base Uncased | concat | 65.68 | 239.3 |
Bert Base Multilingual Uncased | individual_mlps_on_cat_and_numerical_feats_then_concat | 66.73 | 237.3 |
Bert Base Multilingual Uncased | attention_on_cat_and_numerical_feats | 74.72 | 246.3 |
Bert Base Multilingual Uncased | gating_on_cat_and_num_feats_then_sum | 66.64 | 237.8 |
Bert Base Multilingual Uncased | weighted_feature_sum_on_transformer_cat_and_numerical_feats | 71.19 | 245.2 |
Specific training parameters can be seen in datasets/PetFindermy_Adoption_Prediction
There are 2 text columns, 14 categorical columns, and 5 numerical columns.
Model | Combine Feat Method | F1_macro | F1_micro |
---|---|---|---|
Bert Base Multilingual Uncased | text_only | 0.088 | 0.281 |
Bert Base Multilingual Uncased | unimodal | 0.089 | 0.283 |
Bert Base Uncased | concat | 0.199 | 0.362 |
Bert Base Multilingual Uncased | individual_mlps_on_cat_and_numerical_feats_then_concat | 0.244 | 0.352 |
Bert Base Multilingual Uncased | attention_on_cat_and_numerical_feats | 0.254 | 0.375 |
Bert Base Multilingual Uncased | gating_on_cat_and_num_feats_then_sum | 0.275 | 0.375 |
Bert Base Multilingual Uncased | weighted_feature_sum_on_transformer_cat_and_numerical_feats | 0.266 | 0.380 |
We now have a paper you can cite for the Multimodal-Toolkit.
@inproceedings{gu-budhkar-2021-package,
title = "A Package for Learning on Tabular and Text Data with Transformers",
author = "Gu, Ken and
Budhkar, Akshay",
booktitle = "Proceedings of the Third Workshop on Multimodal Artificial Intelligence",
month = jun,
year = "2021",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.maiworkshop-1.10",
doi = "10.18653/v1/2021.maiworkshop-1.10",
pages = "69--73",
}