Corrupted vocab index when running the transformer script

eric-haibin-lin commented 5 years ago

Description

Running the default command and script for transformer training results in a warning of corrupted index, which is misleading for users (whether the script still works) and should be fixed.

Error Message

2019-10-21 04:21:20,809 - root - Namespace(average_checkpoint=False, average_start=5, batch_size=2700, beam_size=4, bleu='13a', bucket_ratio=0.0, bucket_scheme='exp', dataset='WMT2014BPE', dropout=0.1, epochs=30, epsilon=0.1, full=False, gpus='0,1,2,3,4,5,6,7', hidden_size=2048, log_interval=1, lp_alpha=0.6, lp_k=5, lr=2.0, magnitude=3.0, num_accumulated=16, num_averages=5, num_buckets=20, num_heads=8, num_layers=6, num_units=512, optimizer='adam', save_dir='transformer_en_de_u512', scaled=True, src_lang='en', src_max_len=-1, test_batch_size=256, tgt_lang='de', tgt_max_len=-1, warmup_steps=4000.0)
/home/ubuntu/benchmark/gluon-nlp/src/gluonnlp/vocab/vocab.py:582: UserWarning: Detected a corrupted index in the deserialize vocabulary. For versions before GluonNLP v0.7 the index is corrupted by specifying the same token for different special purposes, for example eos_token == padding_token. Deserializing the vocabulary nevertheless.
  'Detected a corrupted index in the deserialize vocabulary. '

To Reproduce

MXNET_GPU_MEM_POOL_TYPE=Round python train_transformer.py --dataset WMT2014BPE                           --src_lang en --tgt_lang de --batch_size 2700                           --optimizer adam --num_accumulated 16 --lr 2.0 --warmup_steps 4000                           --save_dir transformer_en_de_u512 --epochs 30 --gpus 0,1,2,3,4,5,6,7 --scaled                           --average_start 5 --num_buckets 20 --bucket_scheme exp --bleu 13a --log_interval 1
All Logs will be saved to transformer_en_de_u512/train_transformer.log

Steps to reproduce

(Paste the commands you ran that produced the error.)

run the command above

What have you tried to solve it?

1. 2.

Environment

gluonnlp commit = 76ca4d7

leezu commented 5 years ago

Running assert 1 == src_vocab.token_to_idx[src_vocab.idx_to_token[1]] you will get an assertionerror. The loaded vocabulary does not exhibit the properties of a gluonnlp.Vocab. That's why the warning is printed.

eric-haibin-lin commented 5 years ago

@szhengac shall we change the downloaded vocab?

szhengac commented 5 years ago

I remember that the vocab in wmt14en-de does not specify the eos and pad tokens.

leezu commented 5 years ago

Changing the downloaded vocab will change the indices mapping, as there is currently one invalid token at the beginning of the idx_to_token. So the embedding weights in the model files also need an update.

dmlc / gluon-nlp