Project-MONAI / MONAI

AI Toolkit for Healthcare Imaging
https://monai.io/
Apache License 2.0
5.88k stars 1.09k forks source link

Test error: Distributed call failed in min-dep-os #6696

Open mingxin-zheng opened 1 year ago

mingxin-zheng commented 1 year ago

Describe the bug

/Users/runner/work/MONAI/MONAI/monai/transforms/io/array.py:213: UserWarning: required package for reader PILReader is not installed, or the version doesn't match requirement.
Traceback (most recent call last):
  warnings.warn(
  File "/Users/runner/work/MONAI/MONAI/tests/utils.py", line 541, in _wrapper
/Users/runner/work/MONAI/MONAI/monai/transforms/io/array.py:213: UserWarning: required package for reader ITKReader is not installed, or the version doesn't match requirement.
  warnings.warn(
/Users/runner/work/MONAI/MONAI/monai/transforms/io/array.py:213: UserWarning: required package for reader NrrdReader is not installed, or the version doesn't match requirement.
  warnings.warn(
/Users/runner/work/MONAI/MONAI/monai/transforms/io/array.py:213: UserWarning: required package for reader PydicomReader is not installed, or the version doesn't match requirement.
  warnings.warn(
/Users/runner/work/MONAI/MONAI/monai/transforms/utils.py:561: UserWarning: Num foregrounds 27, Num backgrounds 0, unable to generate class balanced samples, setting `pos_ratio` to 1.
  warnings.warn(
    assert results.get(), "Distributed call failed."
AssertionError: Distributed call failed.

To Reproduce

https://github.com/Project-MONAI/MONAI/actions/runs/5455742504/jobs/9927617836?pr=6623

Expected behavior

The test should pass.

Add any other context about the problem here.

wyli commented 1 year ago

root cause seems to be the github ci runner

test_even (tests.test_sampler_dist.DistributedSamplerTest) ... ok
Process SpawnProcess-80:
Traceback (most recent call last):
  File "/Users/runner/work/MONAI/MONAI/tests/utils.py", line 505, in run_process
    raise e
  File "/Users/runner/work/MONAI/MONAI/tests/utils.py", line 489, in run_process
    dist.init_process_group(
  File "/Users/runner/hostedtoolcache/Python/3.8.17/x64/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 907, in init_process_group
    default_pg = _new_process_group_helper(
  File "/Users/runner/hostedtoolcache/Python/3.8.17/x64/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1009, in _new_process_group_helper
    backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: [enforce fail at /Users/runner/work/pytorch/pytorch/pytorch/third_party/gloo/gloo/transport/uv/device.cc:153] rp != nullptr. Unable to find address for: Mac-1688480011779.local
mingxin-zheng commented 1 year ago

Should we have any next steps?

wyli commented 1 year ago

Let's keep this open, currently in most cases manually rerunning the pipelines clears the error. if it's becoming frequent we can remove the multiprocess tests on macos.