Open mingxin-zheng opened 1 year ago
root cause seems to be the github ci runner
test_even (tests.test_sampler_dist.DistributedSamplerTest) ... ok
Process SpawnProcess-80:
Traceback (most recent call last):
File "/Users/runner/work/MONAI/MONAI/tests/utils.py", line 505, in run_process
raise e
File "/Users/runner/work/MONAI/MONAI/tests/utils.py", line 489, in run_process
dist.init_process_group(
File "/Users/runner/hostedtoolcache/Python/3.8.17/x64/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 907, in init_process_group
default_pg = _new_process_group_helper(
File "/Users/runner/hostedtoolcache/Python/3.8.17/x64/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: [enforce fail at /Users/runner/work/pytorch/pytorch/pytorch/third_party/gloo/gloo/transport/uv/device.cc:153] rp != nullptr. Unable to find address for: Mac-1688480011779.local
Should we have any next steps?
Let's keep this open, currently in most cases manually rerunning the pipelines clears the error. if it's becoming frequent we can remove the multiprocess tests on macos.
Describe the bug
To Reproduce
https://github.com/Project-MONAI/MONAI/actions/runs/5455742504/jobs/9927617836?pr=6623
Expected behavior
The test should pass.
Add any other context about the problem here.