EdinburghNLP / nematus

Open-Source Neural Machine Translation in Tensorflow
BSD 3-Clause "New" or "Revised" License
800 stars 270 forks source link

how to train with multi gpu and won't lower the traing rate? #59

Closed 520jefferson closed 6 years ago

520jefferson commented 6 years ago

i train the model in k40 with gpu0-4,but the rate is about '40 sents/s' every piece of card。the rate is so low comparing to the training process with single card。

batch_size=128 maxlen=50

rsennrich commented 6 years ago

The master branch of Nematus doesn't currently have multi-GPU support, but there is some experimental code that may be merged in soon.

520jefferson commented 6 years ago

@rsennrich when i train with nematus ,if i kill the training procedure according to the PID of GPU, (i set reload is True) then i start the traing procedure again ,whether the training rate will be lower?i have some feeling about this.

rsennrich commented 6 years ago

If you don't change the configuration, Nematus will continue training with the same learning rate. Even if you use an optimizer with adaptive learning rates, such as adam, Nematus stores the information that is necessary to continue with the same learning rates (in model.npz.gradinfo.npz ).

520jefferson commented 6 years ago

@rsennrich I have two question. First,when decoding i use --device-list gpu0 gpu1 gpu2 gpu3 gpu4 ,but it actually use gpu0. Second, it take so much time to decode using translate.py

Attach my decode setting as follows: export THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python translate.py \ -k 11\ -p 1\ -n 1\ --models $MODEL/model.iter468000.npz\ -i /home/mt-srcb/work/nmt/ai_raml/test/valid.en\ -o /home/mt-srcb/work/nmt/ai_raml/test/ai.valid.out\ --device-list gpu0 gpu1 gpu2 gpu3 gpu4

My training setting as follows: --layer_normalisation\ --tie_decoder_embeddings \ --enc_depth 4 \ --dec_depth 4 \ --dec_deep_context \ --enc_recurrence_transition_depth 2 \ --dec_base_recurrence_transition_depth 4 \ --dec_high_recurrence_transition_depth 2 \

bricksdont commented 6 years ago

Hi,

Use a number higher than 1 for the cmd option -p. Each of those processes will then bind to a different GPU device if 1) you used --device-list to specify several devices and 2) several are available on your system.

Side note, you could install and use the new gpuarray backend for theano, then use device=cuda in your flags. See https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29 and http://deeplearning.net/software/libgpuarray/installation.html.

Finally, for fast decoding, people have used Marian which is compatible with some Nematus models. Perhaps worthwhile for you to try, if performance is a priority: https://marian-nmt.github.io/.

520jefferson commented 6 years ago

@bricksdont
marian ,you mean type s2s could decode deep nematus?

bricksdont commented 6 years ago

I did not try this myself, but at least deep models are listed as a feature of S2S: https://marian-nmt.github.io/features/. But you seem to have some experience with Marian development yourself, did you try this already?

520jefferson commented 6 years ago

Thanks, i will try ,i already validate the effect of marian (type --amun) in my corpus (nearly same as nematus,even a bitter high in no deep mode), then i will try the type of s2s.