Auto3dseg error using multiple GPUs

pwrightkcl commented 1 year ago

Describe the bug When I run auto3dseg with multiple GPUs, it gives an error relating to one process being started before another has finished bootstrapping. It runs with a single GPU.

To Reproduce Steps to reproduce the behavior: I am using a Docker image built on the latest MONAI image. I am submitting the image to RunAI with --gpu 4.

Create an AutoRunner object with algos="segresnet".
Run the runner object.

Expected behavior Run runner does data analysis and begins to train.

Here is a log using one GPU where training starts as expected when requesting only a single GPU.

Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
2023-11-10 17:19:16,580 - INFO - AutoRunner using work directory /my/path/autoseg3d/affine
2023-11-10 17:19:16,591 - INFO - Datalist was copied to work_dir: /my/path/autoseg3d/affine/train_test_folds.json
2023-11-10 17:19:16,606 - INFO - Setting num_fold 5 based on the input datalist /my/path/autoseg3d/affine/train_test_folds.json.
2023-11-10 17:19:16,619 - INFO - Running data analysis...
2023-11-10 17:19:16,619 - INFO - Found 1 GPUs for data analyzing!
<progress bar>

Screenshots Here is the log:

Failed to load image Python extension: '/usr/local/lib/python3.8/dist-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
2023-11-15 17:38:59,786 - INFO - AutoRunner using work directory /my/path/autoseg3d/affine
2023-11-15 17:38:59,796 - INFO - Datalist was copied to work_dir: /my/path/autoseg3d/affine/train_test_folds.json
2023-11-15 17:38:59,813 - INFO - Setting num_fold 5 based on the input datalist /my/path/autoseg3d/affine/train_test_folds.json.
2023-11-15 17:38:59,849 - INFO - Running data analysis...
2023-11-15 17:38:59,850 - INFO - Found 4 GPUs for data analyzing!
Failed to load image Python extension: '/usr/local/lib/python3.8/dist-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
2023-11-15 17:39:04,574 - INFO - AutoRunner using work directory /my/path/autoseg3d/affine
2023-11-15 17:39:04,583 - INFO - Datalist was copied to work_dir: /my/path/autoseg3d/affine/train_test_folds.json
2023-11-15 17:39:04,601 - INFO - Setting num_fold 5 based on the input datalist /my/path/autoseg3d/affine/train_test_folds.json.
2023-11-15 17:39:04,636 - INFO - Running data analysis...
2023-11-15 17:39:04,637 - INFO - Found 4 GPUs for data analyzing!
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/forkserver.py", line 280, in main
    code = _serve_one(child_r, fds,
  File "/usr/lib/python3.8/multiprocessing/forkserver.py", line 319, in _serve_one
    code = spawn._main(child_r, parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/lib/python3.8/runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/my/path/code/derivatives/autoseg3d/train.py", line 26, in <module>
    runner.run()
  File "/opt/monai/monai/apps/auto3dseg/auto_runner.py", line 741, in run
    da.get_all_case_stats()
  File "/opt/monai/monai/apps/auto3dseg/data_analyzer.py", line 211, in get_all_case_stats
    with tmp_ctx.Manager() as manager:
  File "/usr/lib/python3.8/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/usr/lib/python3.8/multiprocessing/managers.py", line 579, in start
    self._process.start()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.8/multiprocessing/context.py", line 291, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.8/multiprocessing/popen_forkserver.py", line 35, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.8/multiprocessing/popen_forkserver.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Traceback (most recent call last):
  File "/my/path/code/derivatives/autoseg3d/train.py", line 26, in <module>
    runner.run()
  File "/opt/monai/monai/apps/auto3dseg/auto_runner.py", line 741, in run
    da.get_all_case_stats()
  File "/opt/monai/monai/apps/auto3dseg/data_analyzer.py", line 211, in get_all_case_stats
    with tmp_ctx.Manager() as manager:
  File "/usr/lib/python3.8/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/usr/lib/python3.8/multiprocessing/managers.py", line 583, in start
    self._address = reader.recv()
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

Environment

Ensuring you use the relevant python executable, please paste the output of:

python -c 'import monai; monai.config.print_debug_info()'

Note that the command above didn't work, so I had to make a little .py script with each command on one line. The output is for whatever node the debug job was assigned to on our cluster. I requested the same resources as for the training job that failed, but it may not be the same machine as the one that ran my training script, as there are three different kinds of DGXs on our cluster.

Failed to load image Python extension: '/usr/local/lib/python3.8/dist-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
================================
Printing MONAI config...
================================
MONAI version: 1.2.0rc7+6.g8dd004a0
Numpy version: 1.22.2
Pytorch version: 1.13.1+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: 8dd004a0744e5059c6801befe2397611eb124e3f
MONAI __file__: /opt/monai/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: 0.4.11
ITK version: 5.3.0
Nibabel version: 5.1.0
scikit-image version: 0.20.0
Pillow version: 9.2.0
Tensorboard version: 2.9.0
gdown version: 4.7.1
TorchVision version: 0.15.0a0
tqdm version: 4.65.0
lmdb version: 1.4.1
psutil version: 5.9.4
pandas version: 1.5.2
einops version: 0.6.1
transformers version: 4.21.3
mlflow version: 2.3.2
pynrrd version: 1.0.0

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

================================
Printing system config...
================================
System: Linux
Linux version: Ubuntu 20.04.5 LTS
Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.29
Processor: x86_64
Machine: x86_64
Python version: 3.8.10
Process name: python3
Command: ['python3', '/workdir/code/derivatives/autoseg3d/debug.py']
Open files: []
Num physical CPUs: 48
Num logical CPUs: 96
Num usable CPUs: 96
CPU usage (%): [49.6, 6.6, 0.2, 0.2, 0.4, 100.0, 1.8, 3.3, 0.0, 0.2, 2.9, 0.2, 1.6, 0.6, 0.6, 0.0, 0.2, 0.4, 0.6, 1.4, 0.0, 1.0, 1.6, 0.4, 33.5, 1.6, 1.8, 1.6, 1.8, 1.6, 1.8, 1.8, 2.7, 2.7, 2.0, 40.1, 1.6, 2.3, 0.8, 1.8, 1.8, 1.6, 2.0, 2.0, 1.8, 1.8, 1.0, 1.8, 0.2, 0.6, 0.4, 0.2, 0.6, 0.2, 0.6, 0.2, 0.2, 9.2, 0.0, 0.2, 2.7, 3.9, 0.0, 0.2, 0.2, 0.2, 0.2, 0.4, 0.2, 0.4, 2.5, 4.9, 1.8, 2.0, 1.8, 1.6, 80.6, 2.0, 1.6, 1.6, 1.8, 2.0, 1.8, 1.6, 1.8, 1.6, 1.8, 50.6, 0.2, 1.0, 1.0, 0.2, 0.6, 17.2, 29.0, 100.0]
CPU freq. (MHz): 3005
Load avg. in last 1, 5, 15 mins (%): [6.5, 5.4, 5.0]
Disk usage (%): 77.4
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 1510.6
Available memory (GB): 1466.1
Used memory (GB): 31.6

================================
Printing GPU config...
================================
Num GPUs: 4
Has CUDA: True
CUDA version: 11.7
cuDNN enabled: True
cuDNN version: 8500
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
GPU 0 Name: Tesla V100-SXM3-32GB
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 80
GPU 0 Total memory (GB): 31.7
GPU 0 CUDA capability (maj.min): 7.0
GPU 1 Name: Tesla V100-SXM3-32GB
GPU 1 Is integrated: False
GPU 1 Is multi GPU board: False
GPU 1 Multi processor count: 80
GPU 1 Total memory (GB): 31.7
GPU 1 CUDA capability (maj.min): 7.0
GPU 2 Name: Tesla V100-SXM3-32GB
GPU 2 Is integrated: False
GPU 2 Is multi GPU board: False
GPU 2 Multi processor count: 80
GPU 2 Total memory (GB): 31.7
GPU 2 CUDA capability (maj.min): 7.0
GPU 3 Name: Tesla V100-SXM3-32GB
GPU 3 Is integrated: False
GPU 3 Is multi GPU board: False
GPU 3 Multi processor count: 80
GPU 3 Total memory (GB): 31.7
GPU 3 CUDA capability (maj.min): 7.0

Additional context CC: @marksgraham

pwrightkcl commented 11 months ago

Has anyone seen this? I can add the configs I used if needed.

KumoLiu commented 11 months ago

Hi @pwrightkcl, did you try monai with the latest version? The Auto3DSeg modules support multiple GPU since MONAI 1.2.

pwrightkcl commented 11 months ago

I tried making sure my docker image has monai 1.3 and rerun but got the same error. I can post debug info and error log again if you like.

The first time, I didn't realise I needed "multi_gpu": True in the config, but I added that and still no joy.

KumoLiu commented 11 months ago

Hi @pwrightkcl, I couldn't reproduce the issue. Could you please try this notebook (without modification, it will automatically all available devices) first to see whether you can train it with multigpu? https://github.com/Project-MONAI/MONAI/blob/8e134b8cb92e3c624b23d4d10c5d4596bb5b9d9b/monai/apps/auto3dseg/auto_runner.py#L544C8-L544C8

pwrightkcl commented 11 months ago

Hi @KumoLiu

I have tried the code (minus the matplot lib parts because I'm running inside a docker container) and get the same error. The docker image I'm using use projectmonai/monai:latest. I have to use torch==1.13 to match our cluster's Cuda version, so this may be a Cuda version issue. I added a line to show that (11.7). Our cluster is being upgraded shortly, so if you think it's Cuda I'll try again after the upgrade (up to a week from now).

Here's the output. I notice that the first part is repeated twice, possibly something to do with the parallelisation. I have not set the OMP_NUM_THREADS environment variable for this run, but setting it in the past didn't make any difference.

 missing cuda symbols while dynamic loading
 cuFile initialization failed
11.7
MONAI version: 1.3.0+30.gdfe0b409
Numpy version: 1.22.2
Pytorch version: 1.13.0+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: dfe0b4093546d876ae99421df3130b81f67824e0
MONAI __file__: /opt/monai/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: 0.4.11
ITK version: 5.3.0
Nibabel version: 5.1.0
scikit-image version: 0.22.0
scipy version: 1.11.1
Pillow version: 9.2.0
Tensorboard version: 2.9.0
gdown version: 4.7.1
TorchVision version: 0.14.0+cu117
tqdm version: 4.65.0
lmdb version: 1.4.1
psutil version: 5.9.4
pandas version: 1.5.2
einops version: 0.6.1
transformers version: 4.21.3
mlflow version: 2.8.1
pynrrd version: 1.0.0
clearml version: 1.13.2

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

2023-11-23 16:47:32,727 - INFO - AutoRunner using work directory /nfs/project/WellcomeHDN/kch-ct-ischaemic/derivatives/autoseg3d/test
2023-11-23 16:47:32,734 - INFO - Setting num_fold 3 based on the input datalist /nfs/project/WellcomeHDN/kch-ct-ischaemic/derivatives/autoseg3d/test/sim_datalist.json.
2023-11-23 16:47:32,757 - INFO - Using user defined command running prefix , will override other settings
2023-11-23 16:47:32,758 - INFO - Running data analysis...
2023-11-23 16:47:32,758 - INFO - Found 3 GPUs for data analyzing!
 missing cuda symbols while dynamic loading
 cuFile initialization failed
11.7
MONAI version: 1.3.0+30.gdfe0b409
Numpy version: 1.22.2
Pytorch version: 1.13.0+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: dfe0b4093546d876ae99421df3130b81f67824e0
MONAI __file__: /opt/monai/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: 0.4.11
ITK version: 5.3.0
Nibabel version: 5.1.0
scikit-image version: 0.22.0
scipy version: 1.11.1
Pillow version: 9.2.0
Tensorboard version: 2.9.0
gdown version: 4.7.1
TorchVision version: 0.14.0+cu117
tqdm version: 4.65.0
lmdb version: 1.4.1
psutil version: 5.9.4
pandas version: 1.5.2
einops version: 0.6.1
transformers version: 4.21.3
mlflow version: 2.8.1
pynrrd version: 1.0.0
clearml version: 1.13.2

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

2023-11-23 16:47:38,652 - INFO - AutoRunner using work directory /nfs/project/WellcomeHDN/kch-ct-ischaemic/derivatives/autoseg3d/test
2023-11-23 16:47:38,656 - INFO - Setting num_fold 3 based on the input datalist /nfs/project/WellcomeHDN/kch-ct-ischaemic/derivatives/autoseg3d/test/sim_datalist.json.
2023-11-23 16:47:38,679 - INFO - Using user defined command running prefix , will override other settings
2023-11-23 16:47:38,679 - INFO - Running data analysis...
2023-11-23 16:47:38,679 - INFO - Found 3 GPUs for data analyzing!
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/forkserver.py", line 274, in main
    code = _serve_one(child_r, fds,
  File "/usr/lib/python3.10/multiprocessing/forkserver.py", line 313, in _serve_one
    code = spawn._main(child_r, parent_sentinel)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/lib/python3.10/runpy.py", line 289, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/nfs/project/WellcomeHDN/kch-ct-ischaemic/code/derivatives/autoseg3d/test.py", line 93, in <module>
    runner.run()
  File "/opt/monai/monai/apps/auto3dseg/auto_runner.py", line 792, in run
    da.get_all_case_stats()
  File "/opt/monai/monai/apps/auto3dseg/data_analyzer.py", line 214, in get_all_case_stats
    with tmp_ctx.Manager() as manager:
  File "/usr/lib/python3.10/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 562, in start
    self._process.start()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 300, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_forkserver.py", line 35, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_forkserver.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Traceback (most recent call last):
  File "/nfs/project/WellcomeHDN/kch-ct-ischaemic/code/derivatives/autoseg3d/test.py", line 93, in <module>
    runner.run()
  File "/opt/monai/monai/apps/auto3dseg/auto_runner.py", line 792, in run
    da.get_all_case_stats()
  File "/opt/monai/monai/apps/auto3dseg/data_analyzer.py", line 214, in get_all_case_stats
    with tmp_ctx.Manager() as manager:
  File "/usr/lib/python3.10/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 566, in start
    self._address = reader.recv()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

pwrightkcl commented 11 months ago

Our cluster has been updated to the latest Nvidia drivers, so I tried training with multiple GPUs again, using the latest MONAI Docker image, but got the same errors. Attaching to save space. autoseg3d-train-affine-f40fc32f4aea.log debug_20231204.log

pwrightkcl commented 11 months ago

@KumoLiu I had another look at the test code you gave and saw it sets the CUDA_VISIBLE_DEVICES environment variable. When I set that, it works, and says "Found 1 GPUs for data analyzing!" as expected and runs to completion. When I omit the environment variable and add "multi_gpu": True to the input dict, it fails with the same error as before. Both runs used the same submission parameters, requesting three GPUs to match the three folds in the dummy data.

I'm attaching the logs and you can see that the multi GPU version gets to "Found 3 GPUs for data analyzing!" then repeats the config info before crashing the second time it reaches "Found 3 GPUs". So it looks like the script itself is running twice. This is similar to the training script above, which repeats the first log lines, but doesn't have the config line like the test script. Does this help diagnose what is going wrong?

autoseg3d-test-58c5ca94d76c.log

autoseg3d-test-multi-cc7ebd79de09.log

KumoLiu commented 11 months ago

Hi @pwrightkcl, perhaps the issue is due to the DataAnalyzer can not work in multi-node, could you please try AutoRunner without data analyze?

pwrightkcl commented 11 months ago

Hi @pwrightkcl, perhaps the issue is due to the DataAnalyzer can not work in multi-node, could you please try AutoRunner without data analyze?

Thank you for the suggestion. I'm new to Autoseg3d so can you clarify what you want me to try? I have seen the tutorial breaking down the components of AutoRunner, so I could run DataAnalyzer first on one GPU then run the other steps. Is that what you mean, or is there an input to AutoRunner to tell it to skip the DataAnalyzer step?

I'd be interested if anyone can replicate this problem, since I was able to elicit it just using the Hello World demo.

pwrightkcl commented 11 months ago

@ericspod advised me to put my script inside a main function then add freeze_support() at the top of that, as suggested in the error message. That appears to have fixed the problem when I try it on the hello world example.

Here's the log:

autoseg3d-test-multi-75dd8bd6d2db.log

This log brings up two related questions about multi-gpu:

The output recommends setting OMP_NUM_THREADS for each process. It sounds like 'process' here means the worker for each GPU. So should I set it to the number of threads I want to use divided by number of GPUs I am using?
At the end, the log says There appear to be 12 leaked semaphore objects to clean up at shutdown. What does this mean, and should I be doing something about it?

I realise that these questions, although related, are outside the specific issue, so I can move them to Discussions if you prefer.

kretes commented 11 months ago

@KumoLiu I tried autoseg with multiple gpus and it fails with the same stacktrace in DataAnalyzer indeed. So I've run the data analysis with CUDA_VISIBLE_DEVICE=0, after I got data analysis I killed the process and restarted it with multiple gpus. From now on it started fine.

pwrightkcl commented 10 months ago

@kretes Sorry for slow response (just got back from leave). Just to confirm, the solution for me was to put my code in a main() function and to add freeze_support() as its first line. This prevented DataAnalyzer generating that error. That should be an easier fix than breaking and resuming your pipeline.

Project-MONAI / MONAI

Auto3dseg error using multiple GPUs #7238