Stage 2 training issue : ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9)

JinnnK commented 4 months ago

Hello,

Thank you for your continuous contributions to this excellent research.

I am writing to report an issue while training stage 2. I consistently face the following error after completing the first epoch. Despite various attempts to resolve it, including tuning the samples_per_gpu and workers_per_gpu in the configuration and resetting the data_root, there has been no progress.

Initially, I suspected a VRAM shortage; however, the process only consumes about 15GB on a single GPU, so I believe the issue lies elsewhere.

Here are the specifications of my system:

GPU: RTX 4090
CPU: i9-12900KF
RAM: 32GB
CUDA: 11.8
Linux: Ubuntu 20.04
torch: 1.10.0+cu113
mmcv: 1.40.0

Below is the error log:

[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 815/815, 5.1 task/s, elapsed: 159s, ETA:     0s
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 64877) of binary: /home/hri/.conda/envs/mm/bin/python
Traceback (most recent call last):
  File "/home/hri/.conda/envs/mm/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/hri/.conda/envs/mm/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/hri/.conda/envs/mm/lib/python3.8/site-packages/torch/distributed/run.py", line 723, in <module>
    main()
  File "/home/hri/.conda/envs/mm/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/hri/.conda/envs/mm/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/home/hri/.conda/envs/mm/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/hri/.conda/envs/mm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/hri/.conda/envs/mm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
./tools/train.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-22_15:57:40
  host      : hri-System-Product-Name
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 64877)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 64877
======================================================

I would appreciate any insights or suggestions you might have to help resolve this issue.

Thank you.

Samsara011 commented 1 month ago

Hello, I have also encountered such a problem recently, do you have a solution？

If you have a solution, can you share it?

JinnnK commented 1 month ago

@Samsara011

Hello, stage 2 requires at least 96GB memory. So you must upgrade your memory.

Samsara011 commented 1 month ago

@JinnnK Thanks for your reply! But why does the paper say that it can be trained in 16G video memory, is it because it uses four cards to train, but this does not meet the description in the paper? Looking forward to your reply again!

JinnnK commented 1 month ago

@Samsara011 I mean 96GB of RAM, not VRAM. For VRAM, it requires less than 18GB.

Samsara011 commented 1 month ago

@JinnnK Oh, I see. Thank you for your reply！

NVlabs / VoxFormer

Stage 2 training issue : ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) #57