Closed KumoLiu closed 3 days ago
Vista3d:
2024-09-30 08:36:19,079 - INFO - --- input summary of monai.bundle.scripts.run ---
2024-09-30 08:36:19,079 - INFO - > config_file: ['/workspace/bundles/monai_vista3d_v0.5.2/configs/train.json',
'/workspace/bundles/monai_vista3d_v0.5.2/configs/multi_gpu_train.json']
2024-09-30 08:36:19,079 - INFO - > meta_file: '/workspace/bundles/monai_vista3d_v0.5.2/configs/metadata.json'
2024-09-30 08:36:19,079 - INFO - > logging_file: '/workspace/bundles/monai_vista3d_v0.5.2/configs/logging.conf'
2024-09-30 08:36:19,079 - INFO - ---
2024-09-30 08:36:19,079 - INFO - Setting logging properties based on config: /workspace/bundles/monai_vista3d_v0.5.2/configs/logging.conf.
2024-09-30 08:36:19,132 - INFO - --- input summary of monai.bundle.scripts.run ---
2024-09-30 08:36:19,132 - INFO - > config_file: ['/workspace/bundles/monai_vista3d_v0.5.2/configs/train.json',
'/workspace/bundles/monai_vista3d_v0.5.2/configs/multi_gpu_train.json']
2024-09-30 08:36:19,132 - INFO - > meta_file: '/workspace/bundles/monai_vista3d_v0.5.2/configs/metadata.json'
2024-09-30 08:36:19,132 - INFO - > logging_file: '/workspace/bundles/monai_vista3d_v0.5.2/configs/logging.conf'
2024-09-30 08:36:19,132 - INFO - ---
2024-09-30 08:36:19,133 - INFO - Setting logging properties based on config: /workspace/bundles/monai_vista3d_v0.5.2/configs/logging.conf.
[rank0]:[E930 08:46:20.607101585 ProcessGroupNCCL.cpp:603] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600046 milliseconds before timing out.
[rank0]:[E930 08:46:20.607216035 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E930 08:46:20.649393447 ProcessGroupNCCL.cpp:603] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600089 milliseconds before timing out.
[rank1]:[E930 08:46:20.649548966 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/local/lib/python3.10/dist-packages/monai/utils/module.py", line 243, in instantiate
[rank0]: return component(**kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 822, in __init__
[rank0]: _verify_param_shape_across_processes(self.process_group, parameters)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 287, in _verify_param_shape_across_processes
[rank0]: return dist._verify_params_across_processes(process_group, tensors, logger)
[rank0]: RuntimeError: DDP expects same model across all ranks, but Rank 0 has 258 params, while rank 1 has inconsistent 0 params.
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/monai/bundle/__main__.py", line 31, in <module>
[rank0]: fire.Fire()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 143, in Fire
[rank0]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 477, in _Fire
[rank0]: component, remaining_args = _CallAndUpdateTrace(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank0]: component = fn(*varargs, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/monai/bundle/scripts.py", line 1010, in run
[rank0]: workflow.run()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/monai/bundle/workflows.py", line 363, in run
[rank0]: return self._run_expr(id=self.run_id)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/monai/bundle/workflows.py", line 397, in _run_expr
[rank0]: return self.parser.get_parsed_content(id, **kwargs) if id in self.parser else None
[rank0]: File "/usr/local/lib/python3.10/dist-packages/monai/bundle/config_parser.py", line 290, in get_parsed_content
[rank0]: return self.ref_resolver.get_resolved_content(id=id, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/monai/bundle/reference_resolver.py", line 193, in get_resolved_content
[rank0]: return self._resolve_one_item(id=id, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank0]: self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank0]: self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank0]: self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/monai/bundle/reference_resolver.py", line 171, in _resolve_one_item
[rank0]: self.resolved_content[id] = item.instantiate() if kwargs.get("instantiate", True) else item
[rank0]: File "/usr/local/lib/python3.10/dist-packages/monai/bundle/config_item.py", line 292, in instantiate
[rank0]: return instantiate(modname, mode, **args)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/monai/utils/module.py", line 253, in instantiate
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: Failed to instantiate component 'torch.nn.parallel.DistributedDataParallel' with keywords: module,find_unused_parameters,device_ids
[rank0]: set '_mode_=debug' to enter the debugging mode.
[rank0]:[E930 08:46:20.772669427 ProcessGroupNCCL.cpp:1756] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E930 08:46:20.772681247 ProcessGroupNCCL.cpp:617] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E930 08:46:20.772684547 ProcessGroupNCCL.cpp:623] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E930 08:46:20.773777549 ProcessGroupNCCL.cpp:1560] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600046 milliseconds before timing out.
Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:605 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7803ffef6648 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10a665e (0x7803a685b65e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b9 (0x7803a6866a59 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7803a686ff33 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7803a6871d0d in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0xdc253 (0x780401eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #6: <unknown function> + 0x94ac3 (0x780403aa5ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #7: clone + 0x44 (0x780403b36a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)