Got stuck while training summarization model, both on brnn and transformer.

OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch

https://opennmt.net/

MIT License

6.78k stars 2.25k forks source link

Got stuck while training summarization model, both on brnn and transformer. #1487

Closed JasonCopper closed 5 years ago

JasonCopper commented 5 years ago

I want to train a summarization model with 4 gpus (CUDA Version 10.1), opennmt got stuck and report anything. Also, I cannot kill the process.

JasonCopper commented 5 years ago

No progress after printing "[2019-07-02 14:49:15,642 INFO] number of examples: 100000"

vince62s commented 5 years ago

you need to post your command line.

JasonCopper commented 5 years ago

Thanks, My command line as following. I use multiple gpus to train a model, but it seems opennmt stuck in multigpu setting. In single gpu setting, it will train ~5000 step, then went to stuck. nohup python -u train.py -data data/processed/ -save_model model/v0 -layers 4 -rnn_size 512 -word_vec_size 512 -max_grad_norm 0 -optim adam -encoder_type transformer -decoder_type transformer -position_encoding -dropout 0.2 -param_init 0 -warmup_steps 8000 -learning_rate 2 -decay_method noam -label_smoothing 0.1 -adam_beta2 0.998 -batch_size 4096 -batch_type tokens -max_generator_batches 2 -train_steps 400000 -accum_count 4 -share_embeddings -copy_attn -param_init_glorot -world_size 3 -gpu_ranks 0,1,2 -report_every 100 > google.log & 2>&1

vince62s commented 5 years ago

when it gets stuck, if you open a window with "watch nvidia-smi" do you see the 3 or 4 GPU (your command line suggest 3 GPU when the screenshot says 4) at 100% ?

JasonCopper commented 5 years ago

It does not reach to 100%, the following screenshots is a training process with single gpu, it is stuck either. Training made no progress since yesterday.

JasonCopper commented 5 years ago

Is there a deadlock in summarization code?

JasonCopper commented 5 years ago

But the model keep update. I have confused by the log.

1-800-BAD-CODE commented 5 years ago

I had a similar problem and it was due to a combination pytorch 1.0 and the producer/consumer update to onmt (the docs still say 'should work with 1.0'... it will not)

JasonCopper commented 5 years ago

I had a similar problem and it was due to a combination pytorch 1.0 and the producer/consumer update to onmt (the docs still say 'should work with 1.0'... it will not)

Thanks for you info. So it works in single and multi gpu settings, right? just some log issue.