PyTorch code for the EMNLP 2020 paper "Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision" (Hao Tan and Mohit Bansal).
Outline
Note: I recommend to focus on "Wiki103" first and ingore the code blocks related to "English Wikipedia". "Eng Wiki" might take too long to complete.
pip install -r requirements.txt
Require python 3.6 + (to support huggingface transformers).
In this module (corresponding to Sec 3.2 of the paper), we want to learn a token-image matching model from sentence-image aligned data (i.e., image captioning data). The model "contextually" measures the relevance between tokens (i.e., words) and images. The terminology "contextual" emphasize the nature that the sentences (the context) are considered when measuring the token-image relevance score.
Download MS COCO images:
# MS COCO (Train 13G, Valid 6G)
mkdir -p data/mscoco
wget http://images.cocodataset.org/zips/train2014.zip -P data/mscoco
wget http://images.cocodataset.org/zips/val2014.zip -P data/mscoco
unzip data/mscoco/train2014.zip -d data/mscoco/images/ && rm data/mscoco/train2014.zip
unzip data/mscoco/val2014.zip -d data/mscoco/images/ && rm data/mscoco/val2014.zip
If you already have COCO image on disk. Save them as
data
|-- mscoco
|-- images
|-- train2014
|-- COCO_train2014_000000000009.jpg
|-- COCO_train2014_000000000025.jpg
|-- ......
|-- val2014
|-- COCO_val2014_000000000042.jpg
|-- ......
Download captions (split following the LXMERT project):
mkdir -p data/lxmert
wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_train.json -P data/lxmert/
wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_nominival.json -P data/lxmert/
wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/vgnococo.json -P data/lxmert/
wget https://nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_minival.json -P data/lxmert/
The model is trained on MS COCO with pairwise hinge loss (details in Sec. 3.2 of the paper).
Running Commands:
# Run the cross-modal matching model with single-machine multi-processing distributed training
# "0,1" indicates using the GPUs 0 and 1.
# "bert_resnext" is the name of this snapshot and would be saved at snap/xmatching/bert_resnext
# "--visn resnext101_32x8d" is the vision backbone
# "--lang bert" is the langaugae backbone
# Speed: 20 min ~ 30 min / 1 Epoch, 20 Epochs by default.
bash scripts/run_xmatching.bash 0,1 bert_resnext --visn resnext101_32x8d --lang bert
The options --visn
and --lang
specify the architecture of the encoder.
Tested options
--visn $VISN_MODEL
VISN_MODEL={resnet18, resnet34, resnet50, resnet101, resnet152,
wide_resnet50_2, wide_resnet101_2, resnext101_32x8d (default), ...}
--lang $LANG_MODEL
LANG_MODEL={bert, roberta, xlnet, bert-large, ...}
For visual backbones, the models in torchvision are mostly supported. You might need to handle the last FC layer, because it is written differently in different backbones. The language backbones are initialized from huggingface transformers.
We found that the results with XLNet is pretty low but have not identified the reason. Results of other backbones are similar.
The vokenization is a bridge between the cross-modality (words-and-image) matching models (xmatching) and visually-supervised lagnauge models (vlm). The final goal is to convert the language tokens to related images (we called them vokens). These vokens enable the visual supervision of the language model. We mainly provide pr-eprocessing tools (i.e., feature extraction, tokenization, and vokenization) and evaluation tools of previous cross-modal matching models here. Here is a diagram of these processes and we next discuss them one-by-one:
Extracting Image Features-----> Benchmakring the Matching Models (Optional) --> Vokenization
Downloading Language Data --> Tokenization -->-->--/
We provide scripts to get the datasets "wiki103" and "wiki". We would note them as "XX-cased" or "XX-uncased" where the suffix "cased" / "uncased" only indicates the property of the raw text.
bash data/wiki103/get_data_cased.sh
bash data/wiki/get_data_cased.bash en
Note: For RoBERTa, it requires an untokenized version of wiki (o.w. the results would be much lower), so please use the following command:
bash data/wiki/get_data_cased_untokenized.bash en
Note: I recommend to focus on "Wiki103" first and ingore the code blocks related to "English Wikipedia". "Eng Wiki" might take too long to complete.
We next tokenize the language corpus. It would locally save three files: "$dataset_name.$tokenizer_name", "$dataset_name.$tokenizer_name.hdf5", and "$dataset_name.$tokenizer_name.line". Taking the wiki103 dataset and BERT tokenizer as an example, we convert the training file into
data
|-- wiki103-cased
|-- wiki.train.raw.bert-base-uncased
|-- wiki.train.raw.bert-base-uncased.hdf5
|-- wiki.train.raw.bert-base-uncased.line
The txt file wiki.train.raw.bert-base-uncased
saves the tokens and each line in this file is the tokens of a line
in the original file,
The hdf5 file wiki.train.raw.bert-base-uncased.hdf5
stores all the tokens continuously and use
wiki.train.raw.bert-base-uncased.line
to index the starting token index of each line.
The ".line" file has L+1
lines where L
is the number of lines in the original files.
Each line has a range "line[i]" to "line[i+1]" in the hdf5 file.
Commands:
bash tokenization/tokenize_wiki103_bert.bash
bash tokenization/tokenize_wiki_bert.bash
The image pre-processing extracts the image features to build the keys in the vokenization retrieval process.
Since MS COCO images are used in training the cross-modal matching model as in xmatching. We will use the Visual Genome images as candidate vokens for retrievel. We here download the images first.
wget https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip -P data/vg/
wget https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip -P data/vg/
unzip data/vg/images.zip -d data/vg/images && rm data/vg/images.zip
unzip data/vg/images2.zip -d data/vg/images && rm data/vg/images2.zip
cd data/vg/images
mv VG_100K/* .
mv VG_100K_2/* .
rm -rf VG_100K VG_100K_2
cd ../../../
If you already have Visual Genome image on disk. Save them as
data
|-- vg
|-- images
|-- 1000.jpg
|-- 1001.jpg
|-- ......
We first build a list of universal image indexes with
vokenization/create_image_ids.py.
It is used to unify the image ids in different experiments
thus the feature array stored in hdf5 could be universally indexed.
The image ids are saved under a shared path LOCAL_DIR
(default to data/vokenization
)
defined in vokenization/common.py.
The image ids are saved under data/vokenization/images
with format {IMAGE_SET}_ids.txt
.
We will make sure that all the experiments agree with this meta info,
so that we would not get different indexing in different retrieval experiments.
Note: The ids created by create_image_ids.py are only the order of the images. The actual images in the dictionary are provided by
extract_keys.bash
, thus is corresponding to the_paths.txt
, because theextract_keys
will filter all broken images and non-existing images.
Commands:
# Step 1, Build image orders.
python vokenization/create_image_ids.py
Extract image features regarding the list built above, using code
vokenization/extract_vision_keys.py.
The code will first read the image ids saved in data/vokenization/images/{IMAGE_SET}_ids.txt
and locate the images.
The features will be saved under snap/xmatching/bert_resnext/keys/{IMAGE_SET}.hdf5
.
It finishes within 1 hour.
Commands:
# Step 2, Extract features.
# bash scripts/extract_keys.bash $GPU_ID $MODEL_NAME
bash scripts/extract_keys.bash 0 bert_resnext
Before evaluating, please make sure that
extracting_image_features
andtokenization
are completed.
We benchmark the performance of cross-modal matching models from large scale. The evaluation includes two different metrics: diversity and the retrieval performance.
Diversity (in vokenization/evaluate_diversity.py) ensures that the same token type is mapped to diverse images regarding its context (i.e., the sentence). Retrieval (in vokenization/evaluate_retrieval.py) measures the correspondence of the token and the retrieved images.
We gather these two utils into one script and the command here:
bash scripts/xmatching_benchmark.bash 0 bert_resnext
After all these steps, we could start to vokenize the language corpus.
It would load the tokens saved in dataset_name.tokenizer_name.hdf5
and uses the line-split information in dataset_name.tokenzier_name.line
.
The code is optimized and could be continued by just rerunning it.
The vokens will be saved in snap/xmatching/bert_resnext/vokens/wiki.train.raw.vg_nococo.hdf5
by default.
The file snap/xmatching/bert_resnext/vokens/wiki.train.raw.vg_nococo.ids
contains the universal image ids
for each voken,
e.g., the image id vg_nococo/8
corresponds to 8-th feature
saved in snap/xmatching/bert_resnext/keys/vg_nococo.hdf5
.
Note:
--tokenizer-name
must be provided in the script.
Commands
# Note: mp is the abbreviation for "multi-processing"
# bash scripts/mpvokenize_wiki103.bash $USE_GPUS $SNAP_NAME
bash scripts/mpvokenize_wiki103.bash 0,1,2,3 bert_resnext
# bash scripts/mpvokenize_wiki.bash $USE_GPUS $SNAP_NAME
bash scripts/mpvokenize_wiki.bash 0,1,2,3 bert_resnext
The script will call vokenization/vokenize_corpus_mp.py to vokenize a corpus. The vokenziation happens in vokenization/vokenization.py and it use vokenization/indexing.py to do nearest neighbor search (based on faiss).
As discussed in Sec. 2 of the paper, we use previous generated vokens to pre-train the model with visual supervision.
After the vokenization process of wiki103, we could run the model with command:
# bash scripts/small_vlm_wiki103_glue.bash $GPUs $SNAP_NAME
bash scripts/small_vlm_wiki103.bash 0,1,2,3 wiki103_bert_small
It will call
vlm/run_vlm_distributed.py
and run a BERT-6Layers-512Hiddens model on wiki103
dataset with the support of voken supervisions.
The snapshot will be saved to snap/vlm/wiki103_bert_small
.
We recommend to run this Wiki103 experiment first since it will finish
in a reasonable time (20 hours).
The pure BERT pre-training option is also available later
for comparisons.
Note: defautly, the mixed-precision training is not used. To support the mixed precision pre-training, please install the nvidia/apex library with command:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
After that, you could bring back the option --fp16
and --fp16_opt_level O2
in
the script scripts/small_vlm_wiki103.bash
.
I recommend to use --fp16_opt_level O2
.
Although the option O2 might be unstable,
it saves a lot memory:
the max per-gpu-batch-size is 32 with O1 but 64 with O2.
After the vokenization process of wiki103, we could run the model with command:
# bash scripts/base_vlm_wiki.bash $GPUs $SNAP_NAME
bash scripts/base_vlm_wiki.bash 0,1,2,3 wiki_bert_base
It will run a BERT-12Layers-768Hiddens (same as BERT_BASE) model on the English Wikipedia
dataset with the support of voken supervisions.
The snapshot will be saved to snap/vlm/wiki_bert_base
.
It takes around 3-5 days on 4 Titan V / GTX 2080
and around 5-7 days to finish in 4 Titan Pascal/T4 cards.
(This estimation is accurate since I inevitably run experiments on all these servers...).
Titan V / 2080 / T4 have native support of mixed precision training (triggered by --fp16
option and need
installing apex).
The speed would be much faster.
Titan Pascal would also save some memory with the --fp16
option.
We defautly use the GLUE benchmark
(e.g., SST,
MRPC,
QQP,
MNLI,
QNLI,)
as downstreaming tasks.
Other tasks could be evaluated following the setup here
by changing the option --model_name_or_path
to the correct snapshot path snap/bert/wiki103
.
This downloaindg scrip is copied from huggingface transformers project. Since the transformers is still under dense development, the change of APIs might affect the code. I have upgraded the code compaticability to transformers==3.3.
wget https://raw.githubusercontent.com/huggingface/transformers/master/utils/download_glue_data.py
python download_glue_data.py --data_dir data/glue --tasks all
The pre-trained snapshots are evaluated by fine-tuning them on the GLUE benchmark. The code are modified from the huggingface transformers.
Running GLUE evaluation for snapshots from different epochs:
# bash scripts/run_glue_epochs.bash $GPUS #SNAP_PATH --snaps $NUM_OF_SNAPS
bash scripts/run_glue_epochs.bash 0,1,2,3 snap/vlm/wiki103_bert_small --snaps 7
It will assess 7 snaps using all 0,1,2,3 GPUs.
Setting snaps=-1
will assess all checkpoints.
If you just want to evaluate the last (usually the best) snapshot, please use:
bash scripts/run_glue_epochs.bash 0 snap/vlm/wiki103_bert_small --snaps 1
For all results saved under snap/
(whatever the dir names),
running the folloing command will print out all the results.
python vlm/show_glue_results_epochs.py
It will print results like
snap/vlm/test_finetune/glueepoch_checkpoint-epoch0019
RTE MRPC STS-B CoLA SST-2 QNLI QQP MNLI MNLI-MM GLUE
54.51 84.72 87.18 52.32 90.02 88.36 87.16 81.92 82.57 78.75
snap/vlm/bert_6L_512H_wiki103_sharedheadctr_noshuffle/glueepoch_checkpoint-epoch0029
RTE MRPC STS-B CoLA SST-2 QNLI QQP MNLI MNLI-MM GLUE
58.12 82.76 84.45 26.74 89.56 84.40 86.52 77.56 77.99 74.23
We also provide pure language-model pre-training as baselines.
# bash scripts/small_wiki103.bash $GPUs $SNAP_NAME
bash scripts/small_wiki103.bash 0,1,2,3 bert_small
It will call
vlm/run_lm_distributed.py
and run a BERT-6Layers-512Hiddens model on wiki103
dataset with the masked language model only.
The snapshot will be saved to snap/bert/wiki103_bert_small
.
Or you could directly using the script small_wiki103_glue.bash
to
enable GLUE evaluation after finishing pre-training.
bash scripts/small_wiki103_glue.bash 0,1,2,3 bert_small
Command:
# bash scripts/base_wiki.bash $GPUs $SNAP_NAME
bash scripts/base_wiki.bash 0,1,2,3 bert_wiki
With GLUE evaluation:
bash scripts/base_wiki_glue.bash 0,1,2,3 bert_wiki
Wiki103 (100M tokens)
mkdir -p data/wiki103-cased
wget https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.test.raw.bert-base-uncased.hdf5 -P data/wiki103-cased
wget https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.train.raw.bert-base-uncased.hdf5 -P data/wiki103-cased
wget https://nlp.cs.unc.edu/data/vokenization/wiki103-cased/wiki.valid.raw.bert-base-uncased.hdf5 -P data/wiki103-cased
Wiki (2800 M tokens)
mkdir -p data/wiki-cased
wget https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.test.raw.bert-base-uncased.hdf5 -P data/wiki-cased
wget https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.train.raw.bert-base-uncased.hdf5 -P data/wiki-cased
wget https://nlp.cs.unc.edu/data/vokenization/wiki-cased/en.valid.raw.bert-base-uncased.hdf5 -P data/wiki-cased
If you find our project useful, please cite this paper:
@inproceedings{tan2020vokenization,
title={Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision},
author={Tan, Hao and Bansal, Mohit},
booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},
year={2020}
}
I thank the support from Bloomberg Data Science Ph.D. Fellowship. We thank the reviewers and Yixin Nie and Jie Lei for their helpful discussions. Part of the code are built based on huggingface transformers and facebook xlm and faiss.
4K3.