Transformer model sensitive to learning rate (reasonable defaults)

davidecaroselli commented 6 years ago

Hello,

I was testing the Transformer model thanks to the support given in this thread: https://github.com/OpenNMT/OpenNMT-py/issues/177 Unfortunately I am not able to translate anything with the "Transformer" model. I took the time to build a super simple, fast test in order to reproduce the problem, the data to train the engine is available here: preprocessed_corpora.tar.gz

I'm running this script:

python preprocess.py \
    -train_src ~/preprocessed_corpora/europarl.train.en \
    -train_tgt ~/preprocessed_corpora/europarl.train.it \
    -valid_src ~/preprocessed_corpora/europarl.dev.en   \
    -valid_tgt ~/preprocessed_corpora/europarl.dev.it   \
    -save_data demo/data

python train.py -data demo/data -save_model demo/model -gpuid 0

python translate.py -gpu 0 -model demo/model_*_e13.pt   \
    -src ~/preprocessed_corpora/europarl.test.en        \
    -tgt ~/preprocessed_corpora/europarl.test.it        \
    -replace_unk -verbose -output translations.it

The output is as expected, the translations.it file has 30 lines with a very poor translation: it's ok, the engine works, it just needs more data.

However if I replace the second instruction with:

python train.py -data demo/data -save_model demo/model -gpuid 0  \
    -encoder_type transformer -decoder_type transformer          \
    -word_vec_size 512 -rnn_size 512

I am able to complete the training, but when I try to translate the test set I get this error:

Loading model
Traceback (most recent call last):
  File "translate.py", line 151, in <module>
    main()
  File "translate.py", line 94, in main
    = translator.translate(batch, data)
  File "/home/ubuntu/workspace/OpenNMT-py/onmt/Translator.py", line 188, in translate
    pred, predScore, attn, goldScore = self.translateBatch(batch, data)
  File "/home/ubuntu/workspace/OpenNMT-py/onmt/Translator.py", line 137, in translateBatch
    self.model.decoder(inp, src, context, decStates)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/workspace/OpenNMT-py/onmt/Models.py", line 352, in forward
    if state.previous_input:
  File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 123, in __bool__
    torch.typename(self.data) + " is ambiguous")
RuntimeError: bool value of Variable objects containing non-empty torch.cuda.LongTensor is ambiguous

Thanks in advance for your help.

srush commented 6 years ago

I think this possibly a bug. I will retrain and post the correct options for transformer.

srush commented 6 years ago

Fix being merged as https://github.com/OpenNMT/OpenNMT-py/pull/232

Btw, transformer is currently undocumented. We use the following to train a parsing

You definitely need many of these for it to work at all.

python train.py -data data_parse/parse -save_model /tmp/parse4_ -layers  4 -rnn_size 1024  -word_vec_size 1024 -batch_size 32 -epochs 40  -gpuid 1 -report_every 100  -max_grad_norm 0 -optim adam -encoder_type transformer -decoder_type transformer -position_encoding -dropout 0.2 -param_init 0 -warmup_steps 4000 -learning_rate 1 -start_checkpoint_at 20 -decay_method noam

I haven't had a chance to document this yet, but if you want to contribute some docs, would be happy to add them.

davidecaroselli commented 6 years ago

Thanks @srush ! I'm trying it right now and I'll let you know the outcome. If everything works as expected, I will be happy to contribute to the docs.

davidecaroselli commented 6 years ago

Hi @srush

I've got some updates. I successfully trained a model on an actual data set (2M words, ~20h of training):

python train.py -data demo/data -save_model demo/model -gpuid 0                                        \
    -encoder_type transformer -decoder_type transformer                                                \
    -layers 4 -rnn_size 1024  -word_vec_size 1024                                                      \
    -max_grad_norm 0 -optim adam -position_encoding -dropout 0.2 -param_init 0                         \
    -batch_size 32 -epochs 30 -report_every 100 -warmup_steps 4000 -learning_rate 1 -decay_method noam \
    -start_checkpoint_at 20

The output of the training process is available here: training.txt.

For what I can see, it seems that everything went well: validation and training perplexity decrease. Now the problem arises when I try to run a translation with the command:

python translate.py -gpu 0 -model demo/model_*_e30.pt \
    -src ~/data/test/corpus.en  \
    -tgt ~/data/test/corpus.it \
    -replace_unk -verbose -output demo/translations.it

The system outputs only 1-5 words per line, BLEU score is terrible. I'm reporting here some examples just to show you what I mean:

SENT 2318: The file size is not valid .
PRED 2318: .
PRED SCORE: -1.1778
GOLD 2318: Le dimensioni del file non sono valide .
GOLD SCORE: -22.9622

SENT 2322: We are sorry that something went wrong .
PRED 2322: .
PRED SCORE: -3.3032
GOLD 2322: Spiacenti , si è verificato un errore .
GOLD SCORE: -18.8555

SENT 2369: Select the type of document you 'd like to upload
PRED 2369: al
PRED SCORE: -1.6003
GOLD 2369: Seleziona il tipo di documento che vuoi caricare
GOLD SCORE: -34.0720

SENT 2383: Phone number ( mobile , work , home )
PRED 2383: ( )
PRED SCORE: -1.5090
GOLD 2383: Numero di telefono ( cellulare , lavoro , abitazione )
GOLD SCORE: -52.0793

What can I do to try fix the problem?

srush commented 6 years ago

oh this looks terrible. Can you send me your model? I'll try to diagnose.

Also can you try running with --n_best 10? It might fix the short sentences issue.

davidecaroselli commented 6 years ago

Unfortunately I cannot share this model, but I'm currently re-running a training with public data and it should be ready by tomorrow. I will test that and share all the data if the problem arises again.

BTW, how do you prefer to receive the model? It's +2G file and I don't know if Github will allow me to do that.

srush commented 6 years ago

I am training my own, let me see if I can replicate the bug.

davidecaroselli commented 6 years ago

Hi @srush

I run a training utill the 24th epoch and I observe the same issue. Actually even worse: all translations are the same: "Ja ." (it's english to german). This time the training perplexity decrease, while the validation set perplexity gets higher and higher.

Here you can find all files (training and models): transformer-test.zip

thank you!

srush commented 6 years ago

Can you paste the log files? What perplexity are you seeing? When does the validation accuracy start getting higher? Are you running with dropout?

davidecaroselli commented 6 years ago

Hi @srush in the zip file you'll find everything. I have included the log file, the whole training folder (including test and dev sets), the model at the 24th epoch and the script I used to preprocess/train/translate - it's called onmt_transformer.sh

davidecaroselli commented 6 years ago

Hi @srush

any news? Is there something I can do in order to help debugging it?

srush commented 6 years ago

Haven't had a chance to look yet. I will get to it.

On Sep 12, 2017 3:17 AM, "Davide Caroselli" notifications@github.com wrote:

Hi @srush https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_srush&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=wnHFZ7D4m-9MRwk-CWlvCGbWEiQX_AvUO2LuMy4Vj7c&m=UKsmdvsqLCXYDXsmg75uEXAp0Bgcko47Ii8LfeUE0dY&s=tmzeFCZxE_bD45v2om5dDK0kebQTzLdULWhmNhm_g14&e=

any news? Is there something I can do in order to help debugging it?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_OpenNMT_OpenNMT-2Dpy_issues_227-23issuecomment-2D328762179&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=wnHFZ7D4m-9MRwk-CWlvCGbWEiQX_AvUO2LuMy4Vj7c&m=UKsmdvsqLCXYDXsmg75uEXAp0Bgcko47Ii8LfeUE0dY&s=ZM9hLyIcvDnnNuzyJZioqUAB4A5acdNCF-PYH5_PEWw&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AACMKiBBoAxqEZni3wJcpyyfSghUrZyAks5shi-5F7gaJpZM4PMG0l&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=wnHFZ7D4m-9MRwk-CWlvCGbWEiQX_AvUO2LuMy4Vj7c&m=UKsmdvsqLCXYDXsmg75uEXAp0Bgcko47Ii8LfeUE0dY&s=v1fQWwShc1PaHZxBas_4gGLCx5DA2k3KKZ8CK3WSXdE&e= .

srush commented 6 years ago

Hmm all your options look correct but you can tell after the first epoch that training is not happening. 200 ppl is terrible. I will try running it locally and see what is happening.

srush commented 6 years ago

okay, I think I see the issue. Your learning rate is getting too high at batch 3000. I am trying another run with 512 and warmup_step 16000. I will let you know how that goes.

srush commented 6 years ago

Seems this was the issue. I'm already below 100 ppl with your options changing warmup_step 16000 and rnn_size and word_vec_size to 512.

Epoch  2,  1900/ 5879; acc:  35.29; ppl:  78.15; 3641 src tok/s; 3602 tgt tok/s;    304 s elapsed
Epoch  2,  2000/ 5879; acc:  35.45; ppl:  79.34; 3711 src tok/s; 3645 tgt tok/s;    320 s elapsed
Epoch  2,  2100/ 5879; acc:  35.21; ppl:  79.57; 3698 src tok/s; 3649 tgt tok/s;    336 s elapsed
Epoch  2,  2200/ 5879; acc:  35.28; ppl:  81.19; 3718 src tok/s; 3660 tgt tok/s;    352 s elapsed

davidecaroselli commented 6 years ago

Hi @srush,

Unfortunately I'm having a new kind of bug when I try to translate with transformer. The script is always the same but now, starting with commit 0de9c039367b9f8ff3adf30f9e07196d1e2c6016, I'm seeing this error:

While copying the parameter named decoder.embeddings.make_embedding.pe.pe, whose dimensions in the model are torch.Size([5000, 1, 512]) and whose dimensions in the checkpoint are torch.Size([5000, 1, 512]), ...
Traceback (most recent call last):
  File "translate.py", line 153, in <module>
    main()
  File "translate.py", line 77, in main
    translator = onmt.Translator(opt, dummy_opt.__dict__)
  File "/home/ubuntu/workspace/OpenNMT-py/onmt/Translator.py", line 28, in __init__
    opt, model_opt, self.fields, checkpoint)
  File "/home/ubuntu/workspace/OpenNMT-py/onmt/ModelConstructor.py", line 161, in make_base_model
    model.load_state_dict(checkpoint['model'])
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 360, in load_state_dict
    own_state[name].copy_(param)
  File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 65, in __getattr__

    return object.__getattribute__(self, name)
AttributeError: 'Variable' object has no attribute 'copy_'

JianyuZhan commented 6 years ago

This no attribute copy_ problem is the same as this one: https://github.com/OpenNMT/OpenNMT-py/issues/256. Fixed in https://github.com/OpenNMT/OpenNMT-py/commit/dffd1311604cbc9396500418f326dfe4fc5a6f96

OpenNMT / OpenNMT-py

Transformer model sensitive to learning rate (reasonable defaults) #227