NVlabs / LSGM

The Official PyTorch Implementation of "LSGM: Score-based Generative Modeling in Latent Space" (NeurIPS 2021)
Other
340 stars 49 forks source link

RuntimeError: Address already in use #2

Closed pkulwj1994 closed 2 years ago

pkulwj1994 commented 2 years ago

I am trying to run train_vada.py in colab, but got error in title.

$ python train_vada.py

the full error message looks like this:

No Apex Available. Using PyTorch's native Adam. Install Apex for faster training. Experiment dir : /tmp/nvae-diff/expr/exp starting in debug mode Traceback (most recent call last): File "train_vada.py", line 512, in utils.init_processes(0, size, main, args) File "/content/util/utils.py", line 689, in init_processes dist.init_process_group(backend='nccl', init_method='env://', rank=rank, world_size=size) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout) RuntimeError: Address already in use

I have checkes some common issues and find out the error often comes from a wrong reporting from torch distribution settings, how can I fix it, thanks!

arash-vahdat commented 2 years ago

This happens when another process uses the port that we would use for multi GPU communication. It can be also caused by a previous LSGM run that is still hanging.

You can try pkill python to kill all running python processes that might be using the port (use it carefully if your machine is shared with other users). Or, you can simply change the port number at this line: https://github.com/NVlabs/LSGM/blob/5eae2f385c014f2250c3130152b6be711f6a3a5a/util/utils.py#L687