Project-MONAI / model-zoo

MONAI Model Zoo that hosts models in the MONAI Bundle format.
Apache License 2.0
186 stars 67 forks source link

NCCL timeout error in tumor detection and vista3d #685

Closed KumoLiu closed 3 days ago

KumoLiu commented 1 week ago
INFO:__main__:Executing export PYTHONPATH=$PYTHONPATH:/workspace/bundles/monai_pathology_tumor_detection_v0.6.0 && echo $PYTHONPATH && torchrun --standalone --nnodes=1 --nproc_per_node=2 -m monai.bundle run --meta_file /workspace/bundles/monai_pathology_tumor_detection_v0.6.0/configs/metadata.json --config_file "['/workspace/bundles/monai_pathology_tumor_detection_v0.6.0/configs/train.json', '/workspace/bundles/monai_pathology_tumor_detection_v0.6.0/configs/multi_gpu_train.json']" --logging_file /workspace/bundles/monai_pathology_tumor_detection_v0.6.0/configs/logging.conf 2>&1 | tee running.log
:/workspace/bundles/monai_pathology_tumor_detection_v0.6.0
W0930 05:40:18.609000 132162098529408 torch/distributed/run.py:793]
W0930 05:40:18.609000 132162098529408 torch/distributed/run.py:793] *****************************************
W0930 05:40:18.609000 132162098529408 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0930 05:40:18.609000 132162098529408 torch/distributed/run.py:793] *****************************************
2024-09-30 05:40:24,882 - INFO - --- input summary of monai.bundle.scripts.run ---
2024-09-30 05:40:24,883 - INFO - > config_file: ['/workspace/bundles/monai_pathology_tumor_detection_v0.6.0/configs/train.json',
 '/workspace/bundles/monai_pathology_tumor_detection_v0.6.0/configs/multi_gpu_train.json']
2024-09-30 05:40:24,883 - INFO - > meta_file: '/workspace/bundles/monai_pathology_tumor_detection_v0.6.0/configs/metadata.json'
2024-09-30 05:40:24,883 - INFO - > logging_file: '/workspace/bundles/monai_pathology_tumor_detection_v0.6.0/configs/logging.conf'
2024-09-30 05:40:24,883 - INFO - ---

2024-09-30 05:40:24,883 - INFO - Setting logging properties based on config: /workspace/bundles/monai_pathology_tumor_detection_v0.6.0/configs/logging.conf.
2024-09-30 05:40:24,885 - py.warnings - WARNING - Detected deprecated name 'optional_packages_version' in configuration file, replacing with 'required_packages_version'.

2024-09-30 05:40:24,895 - INFO - --- input summary of monai.bundle.scripts.run ---
2024-09-30 05:40:24,895 - INFO - > config_file: ['/workspace/bundles/monai_pathology_tumor_detection_v0.6.0/configs/train.json',
 '/workspace/bundles/monai_pathology_tumor_detection_v0.6.0/configs/multi_gpu_train.json']
2024-09-30 05:40:24,895 - INFO - > meta_file: '/workspace/bundles/monai_pathology_tumor_detection_v0.6.0/configs/metadata.json'
2024-09-30 05:40:24,895 - INFO - > logging_file: '/workspace/bundles/monai_pathology_tumor_detection_v0.6.0/configs/logging.conf'
2024-09-30 05:40:24,895 - INFO - ---

2024-09-30 05:40:24,895 - INFO - Setting logging properties based on config: /workspace/bundles/monai_pathology_tumor_detection_v0.6.0/configs/logging.conf.
2024-09-30 05:40:24,897 - py.warnings - WARNING - Detected deprecated name 'optional_packages_version' in configuration file, replacing with 'required_packages_version'.

2024-09-30 05:40:27,091 - py.warnings - WARNING - `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.

2024-09-30 05:40:27,091 - ignite.engine.engine.SupervisedTrainer - INFO - Engine run resuming from iteration 0, epoch 0 until 2 epochs
2024-09-30 05:40:27,091 - py.warnings - WARNING - `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.

2024-09-30 05:40:43,485 - py.warnings - WARNING - `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.

[rank0]:[E930 05:50:46.378881785 ProcessGroupNCCL.cpp:603] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, NumelIn=11177025, NumelOut=11177025, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
[rank0]:[E930 05:50:46.379064643 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 7, last enqueued NCCL work: 7, last completed NCCL work: 6.
[rank0]:[E930 05:50:47.276529829 ProcessGroupNCCL.cpp:1756] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 7, last enqueued NCCL work: 7, last completed NCCL work: 6.
[rank0]:[E930 05:50:47.276538249 ProcessGroupNCCL.cpp:617] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E930 05:50:47.276541209 ProcessGroupNCCL.cpp:623] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E930 05:50:47.277619072 ProcessGroupNCCL.cpp:1560] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, NumelIn=11177025, NumelOut=11177025, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:605 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x75505bd97648 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10a665e (0x75500265b65e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b9 (0x755002666a59 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75500266ff33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
^[[2;2R^[[3;3R^[[>0;136;0cframe #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x755002671d0d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdc253 (0x75505dcb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #6: <unknown function> + 0x94ac3 (0x75505f921ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #7: clone + 0x44 (0x75505f9b2a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7, OpType=ALLREDUCE, NumelIn=11177025, NumelOut=11177025, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:605 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x75505bd97648 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10a665e (0x75500265b65e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b9 (0x755002666a59 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75500266ff33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x755002671d0d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdc253 (0x75505dcb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #6: <unknown function> + 0x94ac3 (0x75505f921ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #7: clone + 0x44 (0x75505f9b2a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1566 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x75505bd97648 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10a665e (0x75500265b65e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd48eb4 (0x7550022fdeb4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdc253 (0x75505dcb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x94ac3 (0x75505f921ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #5: clone + 0x44 (0x75505f9b2a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W0930 05:50:47.507000 132162098529408 torch/distributed/elastic/multiprocessing/api.py:891] Sending process 1697 closing signal SIGTERM
E0930 05:50:47.509000 132162098529408 torch/distributed/elastic/multiprocessing/api.py:863] failed (exitcode: -6) local_rank: 0 (pid: 1696) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.5.0a0+872d972e41.nv24.8.1', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
monai.bundle FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-30_05:50:47
  host      : 4u4g-0040.ipp2u1.colossus.nvidia.com
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 1696)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1696
KumoLiu commented 1 week ago

Vista3d:

2024-09-30 08:36:19,079 - INFO - --- input summary of monai.bundle.scripts.run ---
2024-09-30 08:36:19,079 - INFO - > config_file: ['/workspace/bundles/monai_vista3d_v0.5.2/configs/train.json',
 '/workspace/bundles/monai_vista3d_v0.5.2/configs/multi_gpu_train.json']
2024-09-30 08:36:19,079 - INFO - > meta_file: '/workspace/bundles/monai_vista3d_v0.5.2/configs/metadata.json'
2024-09-30 08:36:19,079 - INFO - > logging_file: '/workspace/bundles/monai_vista3d_v0.5.2/configs/logging.conf'
2024-09-30 08:36:19,079 - INFO - ---

2024-09-30 08:36:19,079 - INFO - Setting logging properties based on config: /workspace/bundles/monai_vista3d_v0.5.2/configs/logging.conf.
2024-09-30 08:36:19,132 - INFO - --- input summary of monai.bundle.scripts.run ---
2024-09-30 08:36:19,132 - INFO - > config_file: ['/workspace/bundles/monai_vista3d_v0.5.2/configs/train.json',
 '/workspace/bundles/monai_vista3d_v0.5.2/configs/multi_gpu_train.json']
2024-09-30 08:36:19,132 - INFO - > meta_file: '/workspace/bundles/monai_vista3d_v0.5.2/configs/metadata.json'
2024-09-30 08:36:19,132 - INFO - > logging_file: '/workspace/bundles/monai_vista3d_v0.5.2/configs/logging.conf'
2024-09-30 08:36:19,132 - INFO - ---

2024-09-30 08:36:19,133 - INFO - Setting logging properties based on config: /workspace/bundles/monai_vista3d_v0.5.2/configs/logging.conf.
[rank0]:[E930 08:46:20.607101585 ProcessGroupNCCL.cpp:603] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600046 milliseconds before timing out.
[rank0]:[E930 08:46:20.607216035 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E930 08:46:20.649393447 ProcessGroupNCCL.cpp:603] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.
[rank1]:[E930 08:46:20.649548966 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/monai/utils/module.py", line 243, in instantiate
[rank0]:     return component(**kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 822, in __init__
[rank0]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 287, in _verify_param_shape_across_processes
[rank0]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank0]: RuntimeError: DDP expects same model across all ranks, but Rank 0 has 258 params, while rank 1 has inconsistent 0 params.

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/monai/bundle/__main__.py", line 31, in <module>
[rank0]:     fire.Fire()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 143, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 477, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/monai/bundle/scripts.py", line 1010, in run
[rank0]:     workflow.run()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/monai/bundle/workflows.py", line 363, in run
[rank0]:     return self._run_expr(id=self.run_id)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/monai/bundle/workflows.py", line 397, in _run_expr
[rank0]:     return self.parser.get_parsed_content(id, **kwargs) if id in self.parser else None
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/monai/bundle/config_parser.py", line 290, in get_parsed_content
[rank0]:     return self.ref_resolver.get_resolved_content(id=id, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/monai/bundle/reference_resolver.py", line 193, in get_resolved_content
[rank0]:     return self._resolve_one_item(id=id, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank0]:     self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank0]:     self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank0]:     self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/monai/bundle/reference_resolver.py", line 171, in _resolve_one_item
[rank0]:     self.resolved_content[id] = item.instantiate() if kwargs.get("instantiate", True) else item
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/monai/bundle/config_item.py", line 292, in instantiate
[rank0]:     return instantiate(modname, mode, **args)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/monai/utils/module.py", line 253, in instantiate
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: Failed to instantiate component 'torch.nn.parallel.DistributedDataParallel' with keywords: module,find_unused_parameters,device_ids
[rank0]:  set '_mode_=debug' to enter the debugging mode.
[rank0]:[E930 08:46:20.772669427 ProcessGroupNCCL.cpp:1756] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E930 08:46:20.772681247 ProcessGroupNCCL.cpp:617] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E930 08:46:20.772684547 ProcessGroupNCCL.cpp:623] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E930 08:46:20.773777549 ProcessGroupNCCL.cpp:1560] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600046 milliseconds before timing out.
Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:605 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7803ffef6648 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10a665e (0x7803a685b65e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b9 (0x7803a6866a59 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7803a686ff33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7803a6871d0d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdc253 (0x780401eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #6: <unknown function> + 0x94ac3 (0x780403aa5ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #7: clone + 0x44 (0x780403b36a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)