NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 369 forks source link

Horovod training issue #311

Closed fminkin closed 5 years ago

fminkin commented 5 years ago

I'm trying to launch some STT models with 8 gpu horovod setup, but getting an error at this line: https://github.com/NVIDIA/OpenSeq2Seq/blob/master/open_seq2seq/utils/funcs.py#L120

TF version: 1.12.0

  File "run.py", line 93, in <module>
    main()
  File "run.py", line 77, in main
    train(model[0], model[1], debug_port=args.debug_port)
  File "/place/home/f-minkin/nvidia/OpenSeq2Seq/open_seq2seq/utils/funcs.py", line 120, in train
    if load_model_dir or tf.train.latest_checkpoint(checkpoint_dir):
  File "/place/home/f-minkin/seq2seq/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_management.py", line 331, in latest_checkpoint
    ckpt = get_checkpoint_state(checkpoint_dir, latest_filename)
  File "/place/home/f-minkin/seq2seq/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_management.py", line 261, in get_checkpoint_state
    latest_filename)
  File "/place/home/f-minkin/seq2seq/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_management.py", line 55, in _GetCheckpointFilename
    return os.path.join(save_dir, latest_filename)
  File "/place/home/f-minkin/seq2seq/lib/python3.6/posixpath.py", line 78, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

I guess if changing the line to if load_model_dir or (checkpoint_dir and tf.train.latest_checkpoint(checkpoint_dir)) fixes the issue?

borisgin commented 5 years ago

Right, checkpoint of the model was not found in the directory. You can change the name of dir or copy the checkpoint

d2sys commented 5 years ago

I have the same issue, I start to train from scratch and get the same error.

fminkin commented 5 years ago

I’m launching training from scratch, the issue here is that None is passed to tf.train.latest_checkpoint(), if the process is not master. Single gpu works just fine.

d2sys commented 5 years ago

To fix it the code must be changed: if load_model_dir is not None or checkpoint_dir is not None: instead of if load_model_dir or tf.train.latest_checkpoint(checkpoint_dir): in line 120 of utils/funcs.py Afterward works fine!

vsl9 commented 5 years ago

Please try the latest master branch.