MDIL-SNU / SevenNet

SevenNet - a graph neural network interatomic potential package supporting efficient multi-GPU parallel molecular dynamics simulations.
https://pubs.acs.org/doi/10.1021/acs.jctc.4c00190
GNU General Public License v3.0
133 stars 17 forks source link

Multi-GPU training crash: Rendezvous Timeout Error #63

Closed turbosonics closed 3 months ago

turbosonics commented 3 months ago

Hello,

I compiled SevenNet using virtual environment with cuda 11.8 and pytorch 2.3.0 to our local GPU cluster. Training with single GPU runs well, but when I attempt to perform multiple-GPU training, the job crashes with following error:

Traceback (most recent call last):
  File "/home/venv_sevennet_zeusgpu_cuda118_pytorch230/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/venv_sevennet_zeusgpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/venv_sevennet_zeusgpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/venv_sevennet_zeusgpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/venv_sevennet_zeusgpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/venv_sevennet_zeusgpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent
    result = agent.run()
  File "/home/venv_sevennet_zeusgpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/home/venv_sevennet_zeusgpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run
    result = self._invoke_run(role)
  File "/home/venv_sevennet_zeusgpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 870, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/home/venv_sevennet_zeusgpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/home/venv_sevennet_zeusgpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers
    self._rendezvous(worker_group)
  File "/home/venv_sevennet_zeusgpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/home/venv_sevennet_zeusgpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 548, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/home/venv_sevennet_zeusgpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1116, in next_rendezvous
    self._op_executor.run(
  File "/home/venv_sevennet_zeusgpu_cuda118_pytorch230/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 661, in run
    raise RendezvousTimeoutError()
torch.distributed.elastic.rendezvous.api.RendezvousTimeoutError

How can I escape from this crash?

YutackPark commented 3 months ago

After I reading the docs (https://pytorch.org/docs/stable/elastic/rendezvous.html#torch.distributed.elastic.rendezvous.api.RendezvousTimeoutError), I think the error caused by your local cluster or pytorch itself. Sorry, I can not help you in more detail. Multi-GPU training of SevenNet is not an experimental or prototype feature. We have used it rigorously, even with multi-node, multi-GPU setups.

I recommend you debug with a small model, batch, and data size. Note that SevenNet uses only one CPU core, one task, per GPU.