Could not replicate results obtained with OpenNMT-py

mattiadg commented 7 years ago

Hi all,

I've run a training on a private small dataset (~200K parallel sentences) trying to use the same hyperparameters for both OpenNMT-py and Sockeye, as far as it was possible.

After 50 epochs I stopped the training with OpenNMT-py, which had a validation perplexity of 3.03, while after 50 epochs the validation perplexity was more than 400 with Sockeye. The training continued until epoch 137, where the validation perplexity was still more than 200.

The followings are the training commands:

OpenNMT-py

python $OpenNMTpy/train.py -data data/train.pt -save_model models/output -brnn -batch_size 120 -epochs 50 -start_epoch=1 -optim sgd -learning_rate 1 -learning_rate_decay 0.9 -start_decay_at 9 -gpus 0 -dropout 0.3 -brnn_merge sum

Sockeye

python -m sockeye.train --source data/train.en --target data/train.it --validation-source data/dev.en --validation-target data/dev.it --output models --device-ids 1 --rnn-num-layers 2 --rnn-num-hidden 500 --num-embed 500 --max-seq-len 50 --batch-size 120 --dropout 0.3 --optimizer sgd --initial-learning-rate 1.0 --learning-rate-reduce-factor 0.9 --clip-gradient 5 --attention-type 'dot' --use-fused-rnn --normalize-loss

Something to point out:

500 is the default value in OpenNMT-py for both rnn-units and word embedding size.
From what I've understood, the attention implemented there is equivalent to the dot attention.
Without --normalize-loss the model diverges immediately, producing really high perplexity.

I'll try with a publicly available dataset to provide you with more information.

UPDATE: I'm trying with IWSLT2016 En-Fr and the same training commands, but the situation is exactly the same.

tdomhan commented 7 years ago

Hi! generally a learning rate of 1.0 is incredibly high and I'm not surprised that the model diverges. I'm not exactly sure how OpenNMT-py normalizes the loss, which is very important to know when trying to compare the two setups. By default Sockeye does not normalize the loss. This means that the magnitude of the loss scales with the number of target words in a batch. For this reason generally we would choose a much smaller learning rate (see the default value of 0.0003). Now when --normalize-loss is turned on, the loss becomes independent of the batch size and one can use a larger learning rate. Effectively a normalized loss with a learning rate multiplied by the number of words in a batch should be equivalent to not normalizing the loss. Note also that from what I know MXNet clips gradients based on the absolute value rather then based on the norm. The norm clipping in PyTorch on the other hand rescales the gradient to make sure the norm is bounded, which in turn could also allow for higher learning rates. So long story short because of these implementation details you might need different hyperparameters for the two systems. Specifically a lower learning rate would probably make sense.

mattiadg commented 7 years ago

Ok, once I realized I could not replicate the same settings, I've started a training with adam and initial learning rate 0.003. Now the perplexity is going down still slower than the other case, but at least it seems reasonable.

I'll update this thread at the end of the training.

Thanks.

mattiadg commented 7 years ago

The training with these parameters is finished, but the output of Sockeye is 2 bleu scores under OpenNMT-py (35.73 vs 37.84).

For completeness, this is the script I've run:

python -m sockeye.train --source $data/train.$src --target $data/train.$tgt --validation-source $data/dev.$src --validation-target $data/dev.$tgt --output $output_dir --device-ids 1 --rnn-num-layers 2 --rnn-num-hidden 500 --num-embed 500 --max-seq-len 50 --batch-size 120 --dropout 0.3 --optimizer adam --initial-learning-rate 0.0003 --learning-rate-reduce-factor 0.9 --clip-gradient 5 --attention-type 'dot' --use-fused-rnn

I should have dropped "--learning-rate-reduce-factor 0.9" but it should not affect, I hope.

tdomhan commented 7 years ago

Does OpenNMT also use two layers by default? With such a small dataset like IWSLT it might be easier to converge with a smaller network. From what I can tell from the OpenNMT-py code the attention mechanism is what is called 'mlp' in Sockeye. While 'dot' is usually faster you also normally loose a bit in terms of BLEU.

mattiadg commented 7 years ago

Ok, then I'll try with mlp attention.

The number of layers is two in both cases.

mattiadg commented 7 years ago

@tdomhan I've tried with mlp attention but it is still 2 bleu score under OpenNMT-py. Have you compared the performance on bigger datasets?

tdomhan commented 7 years ago

It's hard to figure out why that is, as there are probably some other subtle differences in how the models are implemented. I'll close this issue for now.

awslabs / sockeye

Could not replicate results obtained with OpenNMT-py #16

OpenNMT-py

Sockeye