This is the source code that accompanies Multilingual Image Description with Neural Sequence Models . You can use it to train multilingual multimodal language models for image description.
Download a pre-processed version of the IAPRTC-12 dataset for English and German from Dropbox. Unzip into iaprtc12_eng
and iaprtc12_ger
, respectively.
Run python util/makejson.py --path iaprtc12_eng
followed by python util/jsonmat2h5.py --path iaprtc12_eng
to create the dataset.h5 file expected by GroundedTranslation. Repeat this process, replacing eng
for ger
to create the German dataset.h5 file.
Run THEANO_FLAGS=floatX=float32,device=gpu0 python train.py --dataset iaprtc12_eng --hidden_size=256 --fixed_seed --run_string=fixed_seed-eng256mlm
to train an English Vision-to-Language one-layer LSTM. Training takes 500s/epoch on a Tesla K20X.
By default, this uses --optimiser=adam
, --batch_size=100
instances, --big_batch=10000
and --l2reg=1e-8
weight regularisation. The hidden units have --hidden_size=256
dimensions, with dropout parameters of --dropin=0.5
, and an --unk=3
threshold for pruning the word vocabulary.
This model should report a maximum BLEU4 of 15.21 (PPLX 6.898) on the val split, using a fixed seed of 1234.
Run THEANO_FLAGS=floatX=float32,device=gpu0 python train.py --dataset iaprtc12_ger --hidden_size=256 --fixed_seed --run_string=fixed_seed-ger256mlm
to train a German Vision-to-Language one-layer LSTM. Training takes 500s/epoch on a Tesla K20X.
By default, this uses --optimiser=adam
, --batch_size=100
instances, --big_batch=10000
and --l2reg=1e-8
weight regularisation. The hidden units have --hidden_size=256
dimensions, with dropout parameters of --dropin=0.5
, and an --unk=3
threshold for pruning the word vocabulary.
This model should report a maximum BLEU4 of 11.91 (PPLX 9.347) on the val split, using a fixed seed of 1234.
Run THEANO_FLAGS=floatX=float32,device=gpu0 python extract_hidden_features.py --dataset=iaprtc12_eng --model_checkpoints=PATH_TO_MODEL_CHECKPOINTS --hidden_size=256 --h5_writeable
to extract the final hidden state representations from a saved model state. The representations will be stored in dataset/dataset.h5
in the gold-hidden_feats-vis_enc-256
field.
You can add --use_predicted_tokens
, --hidden_size
, and --no_image
to affect the label of the storage field. Specifically, --hidden_size
can only be varied with an appropriately trained model. --no_image
can only be varied with a model trained over only word inputs. --use_predicted_tokens
only makes sense with an MLM.
--hidden_size=512
-> gold-hidden_feats-vis_enc-512
(multimodal hidden features with 512 dims)--use_predicted_tokens
-> predicted-hidden_feats-vis_enc-256
(hidden features from predicted descriptions)--no_image
-> gold-hidden_feats-mt_enc-256
(LM-only hidden features)If you want to train a German model with transferred features from English, run THEANO_FLAGS=floatX=float32,device=gpu0 python train.py --dataset iaprtc12_ger --hidden_size=256 --fixed_seed --source_vectors=iaprtc12_eng --source_type=gold --source_enc=vis_enc --run_string=fixed_seed-eng256mlm-ger256mlm
to train a German-to-English one-layer LSTM.
By default, this uses --optimiser=adam
, --batch_size=100
instances, --big_batch=10000
and --l2reg=1e-8
weight regularisation. The hidden units have --hidden_size=256
dimensions, with dropout parameters of --dropin=0.5
, and an --unk=3
threshold for pruning the word vocabulary.
This model should report a maximum BLEU4 of 14.79 (PPLX 9.525) on the val split, using a fixed seed of 1234. This represents a 2.88 BLEU point improvement over the German monolingual baseline.
In the other direction, let's train an English model with transferred German features: THEANO_FLAGS=floatX=float32,device=gpu0 python train.py --dataset iaprtc12_eng --hidden_size=256 --fixed_seed --source_vectors=iaprtc12_ger --source_type=gold --source_enc=vis_enc --run_string=fixed_seed-ger256mlm-eng256mlm
. This model should report a maximum BLEU4 of 19.78 (PPLX 6.148) on the val split, using a fixed seed of 1234. This represents a 4.57 BLEU point improvement over the monolingual baseline.