Muennighoff / vilio

🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle
https://arxiv.org/abs/2012.07788
MIT License
88 stars 29 forks source link
ernie-vil hateful-memes lxmert oscar transformers uniter vision-and-language vision-transformer visualbert


🥶VILIO🥶


Build GitHub release Transformers Documentation Contributor Covenant

State-of-the-art Visio-Linguistic Models 🥶

## Updates ### 06/2021 - Hateful Memes CSV Files - The CSV files that were used for the scores in the vilio paper are now available here ### 06/2021 - Inference on any meme - Thanks to the initiative by katrinc, here are two notebooks for using Vilio to perform pure inference on any meme you want :) - Just adapt the example input dataset / input model to use a different meme / pretrained model🥶 - GPU: https://www.kaggle.com/muennighoff/vilioexample-nb - CPU: https://www.kaggle.com/muennighoff/vilioexample-nb-cpu ## Ordering Vilio aims to replicate the organization of huggingface's transformer repo at: https://github.com/huggingface/transformers - /bash Shell files to reproduce hateful memes results - /data By default, directory for loading in data & saving checkpoints - /ernie-vil Ernie-vil sub-repository written in PaddlePaddle - /fts_lmdb Scripts for handling .lmdb extracted features - /fts_tsv Scripts for handling .tsv extracted features - /notebooks Jupyter Notebooks for demonstration & reproducibility - /py-bottm-up-attention Sub-repository for tsv feature extraction forked & adapted from [here](https://github.com/airsplay/py-bottom-up-attention) - src/vilio All implemented models (also see below for a quick overview of models) - /utils Pandas & ensembling scripts for data handling - entry.py files Scripts used to access the models and apply model-specific data preparation - pretrain.py files Same purpose as entry files, but for pre-training; Point of entry for pre-training - hm.py Training code for the hateful memes challenge; Main point of entry - param.py Args for running hm.py ## Usage Follow SCORE_REPRO.md for reproducing performance on the Hateful Memes Task.
Follow GETTING_STARTED.md for using the framework for your own task.
See the paper at: https://arxiv.org/abs/2012.07788 ## Architectures 🥶 Vilio currently provides the following architectures with the outlined language transformers: 1. **[E - ERNIE-VIL](https://arxiv.org/abs/2006.16934)** [ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph](https://arxiv.org/abs/2006.16934) - [ERNIE: Enhanced Language Representation with Informative Entities](https://arxiv.org/abs/1905.07129) 1. **[D - DeVLBERT](https://arxiv.org/abs/2008.06884)** [DeVLBert: Learning Deconfounded Visio-Linguistic Representations](https://arxiv.org/abs/2008.06884) - [BERT: Bidirectional Transformers](https://arxiv.org/abs/1810.04805) 1. **[O - OSCAR](https://arxiv.org/abs/2004.06165)** [Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks](https://arxiv.org/abs/2004.06165) - [BERT: Bidirectional Transformers](https://arxiv.org/abs/1810.04805) 1. **[U - UNITER](https://arxiv.org/abs/1909.11740)** [UNITER: UNiversal Image-TExt Representation Learning](https://arxiv.org/abs/1909.11740) - [BERT: Bidirectional Transformers](https://arxiv.org/abs/1810.04805) - [RoBERTa: Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) 1. **[V - VisualBERT](https://arxiv.org/abs/1908.03557)** [VisualBERT: A Simple and Performant Baseline for Vision and Language](https://arxiv.org/abs/1908.03557) - [ALBERT: A Lite BERT](https://arxiv.org/abs/1909.11942) - [BERT: Bidirectional Transformers](https://arxiv.org/abs/1810.04805) - [RoBERTa: Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) 1. **[X - LXMERT](https://arxiv.org/abs/1908.07490)** [LXMERT: Learning Cross-Modality Encoder Representations from Transformers](https://arxiv.org/abs/1908.07490) - [ALBERT: A Lite BERT](https://arxiv.org/abs/1909.11942) - [BERT: Bidirectional Transformers](https://arxiv.org/abs/1810.04805) - [RoBERTa: Robustly Optimized BERT Pretraining Approach](https://arxiv.org/abs/1907.11692) ## To-do's - [ ] Clean-up import statements, python paths & find a better way to integrate transformers (Right now all import statements only work if in main folder) - [ ] Enable loading and running models just via import statements (and not having to clone the repo) - [ ] Find a way to better include ERNIE-VIL in this repo (PaddlePaddle to Torch?) - [ ] Move tokenization in entry files to model-specific tokenization similar to transformers ## Attributions The code heavily borrows from the following repositories, thanks for their great work: - https://github.com/huggingface/transformers - https://github.com/facebookresearch/mmf - https://github.com/airsplay/lxmert ## Citation ```bibtex @article{muennighoff2020vilio, title={Vilio: State-of-the-art visio-linguistic models applied to hateful memes}, author={Muennighoff, Niklas}, journal={arXiv preprint arXiv:2012.07788}, year={2020} } ```