facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.37k stars 6.4k forks source link

tqdm progress bar raises RuntimeError "cannot join current thread" #369

Closed sotetsuk closed 5 years ago

sotetsuk commented 5 years ago

Hi,

I tried the IWSLT'14 machine translation example on pytorch docker image: pytorch/pytorch:0.4.1-cuda9-cudnn7-devel. .And I got the following error at the end of training:

Exception ignored in: <bound method tqdm.__del__ of | epoch 002:  33%|▎| 368/1131 [01:03<02:02,  6.22it/s, loss=7.721, nll_loss=6.908, ppl=120.10, wps=21642, ups=5.8, wpb=3492, bsz=147, num_updates=1500, lr=0.000187563, gnorm=0.960, clip=0%, oom=0, wall=248, train_wall=225]>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tqdm/_tqdm.py", line 931, in __del__
    self.close()
  File "/opt/conda/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1133, in close
    self._decr_instances(self)
  File "/opt/conda/lib/python3.6/site-packages/tqdm/_tqdm.py", line 496, in _decr_instances
    cls.monitor.exit()
  File "/opt/conda/lib/python3.6/site-packages/tqdm/_monitor.py", line 52, in exit
    self.join()
  File "/opt/conda/lib/python3.6/threading.py", line 1053, in join
    raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread

I used --max-update 1500 in the above example but I also got the same error when I run the script with --max-update 50000. This error does not occur when I used --log-format json.

I think these issues/PRs are related:

  1. https://github.com/tqdm/tqdm/issues/613
  2. https://github.com/tqdm/tqdm/pull/641

Especially, the second PR may resolve this.

Thanks,

myleott commented 5 years ago

This seems like an issue for tqdm not fairseq, right? Does it work if you update tqdm?