Closed davidecaroselli closed 6 years ago
I think this possibly a bug. I will retrain and post the correct options for transformer.
Fix being merged as https://github.com/OpenNMT/OpenNMT-py/pull/232
Btw, transformer is currently undocumented. We use the following to train a parsing
You definitely need many of these for it to work at all.
python train.py -data data_parse/parse -save_model /tmp/parse4_ -layers 4 -rnn_size 1024 -word_vec_size 1024 -batch_size 32 -epochs 40 -gpuid 1 -report_every 100 -max_grad_norm 0 -optim adam -encoder_type transformer -decoder_type transformer -position_encoding -dropout 0.2 -param_init 0 -warmup_steps 4000 -learning_rate 1 -start_checkpoint_at 20 -decay_method noam
I haven't had a chance to document this yet, but if you want to contribute some docs, would be happy to add them.
Thanks @srush ! I'm trying it right now and I'll let you know the outcome. If everything works as expected, I will be happy to contribute to the docs.
Hi @srush
I've got some updates. I successfully trained a model on an actual data set (2M words, ~20h of training):
python train.py -data demo/data -save_model demo/model -gpuid 0 \
-encoder_type transformer -decoder_type transformer \
-layers 4 -rnn_size 1024 -word_vec_size 1024 \
-max_grad_norm 0 -optim adam -position_encoding -dropout 0.2 -param_init 0 \
-batch_size 32 -epochs 30 -report_every 100 -warmup_steps 4000 -learning_rate 1 -decay_method noam \
-start_checkpoint_at 20
The output of the training process is available here: training.txt.
For what I can see, it seems that everything went well: validation and training perplexity decrease. Now the problem arises when I try to run a translation with the command:
python translate.py -gpu 0 -model demo/model_*_e30.pt \
-src ~/data/test/corpus.en \
-tgt ~/data/test/corpus.it \
-replace_unk -verbose -output demo/translations.it
The system outputs only 1-5 words per line, BLEU score is terrible. I'm reporting here some examples just to show you what I mean:
SENT 2318: The file size is not valid .
PRED 2318: .
PRED SCORE: -1.1778
GOLD 2318: Le dimensioni del file non sono valide .
GOLD SCORE: -22.9622
SENT 2322: We are sorry that something went wrong .
PRED 2322: .
PRED SCORE: -3.3032
GOLD 2322: Spiacenti , si è verificato un errore .
GOLD SCORE: -18.8555
SENT 2369: Select the type of document you 'd like to upload
PRED 2369: al
PRED SCORE: -1.6003
GOLD 2369: Seleziona il tipo di documento che vuoi caricare
GOLD SCORE: -34.0720
SENT 2383: Phone number ( mobile , work , home )
PRED 2383: ( )
PRED SCORE: -1.5090
GOLD 2383: Numero di telefono ( cellulare , lavoro , abitazione )
GOLD SCORE: -52.0793
What can I do to try fix the problem?
oh this looks terrible. Can you send me your model? I'll try to diagnose.
Also can you try running with --n_best 10
? It might fix the short sentences issue.
Unfortunately I cannot share this model, but I'm currently re-running a training with public data and it should be ready by tomorrow. I will test that and share all the data if the problem arises again.
BTW, how do you prefer to receive the model? It's +2G file and I don't know if Github will allow me to do that.
I am training my own, let me see if I can replicate the bug.
Hi @srush
I run a training utill the 24th epoch and I observe the same issue. Actually even worse: all translations are the same: "Ja ." (it's english to german). This time the training perplexity decrease, while the validation set perplexity gets higher and higher.
Here you can find all files (training and models): transformer-test.zip
thank you!
Can you paste the log files? What perplexity are you seeing? When does the validation accuracy start getting higher? Are you running with dropout?
Hi @srush in the zip file you'll find everything. I have included the log file, the whole training folder (including test and dev sets), the model at the 24th epoch and the script I used to preprocess/train/translate - it's called onmt_transformer.sh
Hi @srush
any news? Is there something I can do in order to help debugging it?
Haven't had a chance to look yet. I will get to it.
On Sep 12, 2017 3:17 AM, "Davide Caroselli" notifications@github.com wrote:
any news? Is there something I can do in order to help debugging it?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_OpenNMT_OpenNMT-2Dpy_issues_227-23issuecomment-2D328762179&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=wnHFZ7D4m-9MRwk-CWlvCGbWEiQX_AvUO2LuMy4Vj7c&m=UKsmdvsqLCXYDXsmg75uEXAp0Bgcko47Ii8LfeUE0dY&s=ZM9hLyIcvDnnNuzyJZioqUAB4A5acdNCF-PYH5_PEWw&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AACMKiBBoAxqEZni3wJcpyyfSghUrZyAks5shi-5F7gaJpZM4PMG0l&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=wnHFZ7D4m-9MRwk-CWlvCGbWEiQX_AvUO2LuMy4Vj7c&m=UKsmdvsqLCXYDXsmg75uEXAp0Bgcko47Ii8LfeUE0dY&s=v1fQWwShc1PaHZxBas_4gGLCx5DA2k3KKZ8CK3WSXdE&e= .
Hmm all your options look correct but you can tell after the first epoch that training is not happening. 200 ppl is terrible. I will try running it locally and see what is happening.
okay, I think I see the issue. Your learning rate is getting too high at batch 3000. I am trying another run with 512 and warmup_step 16000. I will let you know how that goes.
Seems this was the issue. I'm already below 100 ppl with your options changing warmup_step 16000 and rnn_size and word_vec_size to 512.
Epoch 2, 1900/ 5879; acc: 35.29; ppl: 78.15; 3641 src tok/s; 3602 tgt tok/s; 304 s elapsed
Epoch 2, 2000/ 5879; acc: 35.45; ppl: 79.34; 3711 src tok/s; 3645 tgt tok/s; 320 s elapsed
Epoch 2, 2100/ 5879; acc: 35.21; ppl: 79.57; 3698 src tok/s; 3649 tgt tok/s; 336 s elapsed
Epoch 2, 2200/ 5879; acc: 35.28; ppl: 81.19; 3718 src tok/s; 3660 tgt tok/s; 352 s elapsed
Hi @srush,
Unfortunately I'm having a new kind of bug when I try to translate with transformer. The script is always the same but now, starting with commit 0de9c039367b9f8ff3adf30f9e07196d1e2c6016, I'm seeing this error:
While copying the parameter named decoder.embeddings.make_embedding.pe.pe, whose dimensions in the model are torch.Size([5000, 1, 512]) and whose dimensions in the checkpoint are torch.Size([5000, 1, 512]), ...
Traceback (most recent call last):
File "translate.py", line 153, in <module>
main()
File "translate.py", line 77, in main
translator = onmt.Translator(opt, dummy_opt.__dict__)
File "/home/ubuntu/workspace/OpenNMT-py/onmt/Translator.py", line 28, in __init__
opt, model_opt, self.fields, checkpoint)
File "/home/ubuntu/workspace/OpenNMT-py/onmt/ModelConstructor.py", line 161, in make_base_model
model.load_state_dict(checkpoint['model'])
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 360, in load_state_dict
own_state[name].copy_(param)
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 65, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Variable' object has no attribute 'copy_'
This no attribute copy_
problem is the same as this one: https://github.com/OpenNMT/OpenNMT-py/issues/256. Fixed in https://github.com/OpenNMT/OpenNMT-py/commit/dffd1311604cbc9396500418f326dfe4fc5a6f96
Hello,
I was testing the Transformer model thanks to the support given in this thread: https://github.com/OpenNMT/OpenNMT-py/issues/177 Unfortunately I am not able to translate anything with the "Transformer" model. I took the time to build a super simple, fast test in order to reproduce the problem, the data to train the engine is available here: preprocessed_corpora.tar.gz
I'm running this script:
The output is as expected, the
translations.it
file has 30 lines with a very poor translation: it's ok, the engine works, it just needs more data.However if I replace the second instruction with:
I am able to complete the training, but when I try to translate the test set I get this error:
Thanks in advance for your help.