Closed JasonCopper closed 5 years ago
No progress after printing "[2019-07-02 14:49:15,642 INFO] number of examples: 100000"
you need to post your command line.
Thanks, My command line as following. I use multiple gpus to train a model, but it seems opennmt stuck in multigpu setting. In single gpu setting, it will train ~5000 step, then went to stuck. nohup python -u train.py -data data/processed/ -save_model model/v0 -layers 4 -rnn_size 512 -word_vec_size 512 -max_grad_norm 0 -optim adam -encoder_type transformer -decoder_type transformer -position_encoding -dropout 0.2 -param_init 0 -warmup_steps 8000 -learning_rate 2 -decay_method noam -label_smoothing 0.1 -adam_beta2 0.998 -batch_size 4096 -batch_type tokens -max_generator_batches 2 -train_steps 400000 -accum_count 4 -share_embeddings -copy_attn -param_init_glorot -world_size 3 -gpu_ranks 0,1,2 -report_every 100 > google.log & 2>&1
when it gets stuck, if you open a window with "watch nvidia-smi" do you see the 3 or 4 GPU (your command line suggest 3 GPU when the screenshot says 4) at 100% ?
It does not reach to 100%, the following screenshots is a training process with single gpu, it is stuck either. Training made no progress since yesterday.
Is there a deadlock in summarization code?
But the model keep update. I have confused by the log.
I had a similar problem and it was due to a combination pytorch 1.0 and the producer/consumer update to onmt (the docs still say 'should work with 1.0'... it will not)
I had a similar problem and it was due to a combination pytorch 1.0 and the producer/consumer update to onmt (the docs still say 'should work with 1.0'... it will not)
Thanks for you info. So it works in single and multi gpu settings, right? just some log issue.
I want to train a summarization model with 4 gpus (CUDA Version 10.1), opennmt got stuck and report anything. Also, I cannot kill the process.