Open pwrightkcl opened 1 year ago
Has anyone seen this? I can add the configs I used if needed.
Hi @pwrightkcl, did you try monai with the latest version? The Auto3DSeg
modules support multiple GPU since MONAI 1.2.
I tried making sure my docker image has monai 1.3 and rerun but got the same error. I can post debug info and error log again if you like.
The first time, I didn't realise I needed "multi_gpu": True
in the config, but I added that and still no joy.
Hi @pwrightkcl, I couldn't reproduce the issue. Could you please try this notebook (without modification, it will automatically all available devices) first to see whether you can train it with multigpu? https://github.com/Project-MONAI/MONAI/blob/8e134b8cb92e3c624b23d4d10c5d4596bb5b9d9b/monai/apps/auto3dseg/auto_runner.py#L544C8-L544C8
Hi @KumoLiu
I have tried the code (minus the matplot lib parts because I'm running inside a docker container) and get the same error. The docker image I'm using use projectmonai/monai:latest. I have to use torch==1.13 to match our cluster's Cuda version, so this may be a Cuda version issue. I added a line to show that (11.7). Our cluster is being upgraded shortly, so if you think it's Cuda I'll try again after the upgrade (up to a week from now).
Here's the output. I notice that the first part is repeated twice, possibly something to do with the parallelisation. I have not set the OMP_NUM_THREADS environment variable for this run, but setting it in the past didn't make any difference.
missing cuda symbols while dynamic loading
cuFile initialization failed
11.7
MONAI version: 1.3.0+30.gdfe0b409
Numpy version: 1.22.2
Pytorch version: 1.13.0+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: dfe0b4093546d876ae99421df3130b81f67824e0
MONAI __file__: /opt/monai/monai/__init__.py
Optional dependencies:
Pytorch Ignite version: 0.4.11
ITK version: 5.3.0
Nibabel version: 5.1.0
scikit-image version: 0.22.0
scipy version: 1.11.1
Pillow version: 9.2.0
Tensorboard version: 2.9.0
gdown version: 4.7.1
TorchVision version: 0.14.0+cu117
tqdm version: 4.65.0
lmdb version: 1.4.1
psutil version: 5.9.4
pandas version: 1.5.2
einops version: 0.6.1
transformers version: 4.21.3
mlflow version: 2.8.1
pynrrd version: 1.0.0
clearml version: 1.13.2
For details about installing the optional dependencies, please visit:
https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
2023-11-23 16:47:32,727 - INFO - AutoRunner using work directory /nfs/project/WellcomeHDN/kch-ct-ischaemic/derivatives/autoseg3d/test
2023-11-23 16:47:32,734 - INFO - Setting num_fold 3 based on the input datalist /nfs/project/WellcomeHDN/kch-ct-ischaemic/derivatives/autoseg3d/test/sim_datalist.json.
2023-11-23 16:47:32,757 - INFO - Using user defined command running prefix , will override other settings
2023-11-23 16:47:32,758 - INFO - Running data analysis...
2023-11-23 16:47:32,758 - INFO - Found 3 GPUs for data analyzing!
missing cuda symbols while dynamic loading
cuFile initialization failed
11.7
MONAI version: 1.3.0+30.gdfe0b409
Numpy version: 1.22.2
Pytorch version: 1.13.0+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: dfe0b4093546d876ae99421df3130b81f67824e0
MONAI __file__: /opt/monai/monai/__init__.py
Optional dependencies:
Pytorch Ignite version: 0.4.11
ITK version: 5.3.0
Nibabel version: 5.1.0
scikit-image version: 0.22.0
scipy version: 1.11.1
Pillow version: 9.2.0
Tensorboard version: 2.9.0
gdown version: 4.7.1
TorchVision version: 0.14.0+cu117
tqdm version: 4.65.0
lmdb version: 1.4.1
psutil version: 5.9.4
pandas version: 1.5.2
einops version: 0.6.1
transformers version: 4.21.3
mlflow version: 2.8.1
pynrrd version: 1.0.0
clearml version: 1.13.2
For details about installing the optional dependencies, please visit:
https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
2023-11-23 16:47:38,652 - INFO - AutoRunner using work directory /nfs/project/WellcomeHDN/kch-ct-ischaemic/derivatives/autoseg3d/test
2023-11-23 16:47:38,656 - INFO - Setting num_fold 3 based on the input datalist /nfs/project/WellcomeHDN/kch-ct-ischaemic/derivatives/autoseg3d/test/sim_datalist.json.
2023-11-23 16:47:38,679 - INFO - Using user defined command running prefix , will override other settings
2023-11-23 16:47:38,679 - INFO - Running data analysis...
2023-11-23 16:47:38,679 - INFO - Found 3 GPUs for data analyzing!
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/forkserver.py", line 274, in main
code = _serve_one(child_r, fds,
File "/usr/lib/python3.10/multiprocessing/forkserver.py", line 313, in _serve_one
code = spawn._main(child_r, parent_sentinel)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.10/runpy.py", line 289, in run_path
return _run_module_code(code, init_globals, run_name,
File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/nfs/project/WellcomeHDN/kch-ct-ischaemic/code/derivatives/autoseg3d/test.py", line 93, in <module>
runner.run()
File "/opt/monai/monai/apps/auto3dseg/auto_runner.py", line 792, in run
da.get_all_case_stats()
File "/opt/monai/monai/apps/auto3dseg/data_analyzer.py", line 214, in get_all_case_stats
with tmp_ctx.Manager() as manager:
File "/usr/lib/python3.10/multiprocessing/context.py", line 57, in Manager
m.start()
File "/usr/lib/python3.10/multiprocessing/managers.py", line 562, in start
self._process.start()
File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/lib/python3.10/multiprocessing/context.py", line 300, in _Popen
return Popen(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_forkserver.py", line 35, in __init__
super().__init__(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/lib/python3.10/multiprocessing/popen_forkserver.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
Traceback (most recent call last):
File "/nfs/project/WellcomeHDN/kch-ct-ischaemic/code/derivatives/autoseg3d/test.py", line 93, in <module>
runner.run()
File "/opt/monai/monai/apps/auto3dseg/auto_runner.py", line 792, in run
da.get_all_case_stats()
File "/opt/monai/monai/apps/auto3dseg/data_analyzer.py", line 214, in get_all_case_stats
with tmp_ctx.Manager() as manager:
File "/usr/lib/python3.10/multiprocessing/context.py", line 57, in Manager
m.start()
File "/usr/lib/python3.10/multiprocessing/managers.py", line 566, in start
self._address = reader.recv()
File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
Our cluster has been updated to the latest Nvidia drivers, so I tried training with multiple GPUs again, using the latest MONAI Docker image, but got the same errors. Attaching to save space. autoseg3d-train-affine-f40fc32f4aea.log debug_20231204.log
@KumoLiu I had another look at the test code you gave and saw it sets the CUDA_VISIBLE_DEVICES environment variable. When I set that, it works, and says "Found 1 GPUs for data analyzing!" as expected and runs to completion. When I omit the environment variable and add "multi_gpu": True
to the input dict, it fails with the same error as before. Both runs used the same submission parameters, requesting three GPUs to match the three folds in the dummy data.
I'm attaching the logs and you can see that the multi GPU version gets to "Found 3 GPUs for data analyzing!" then repeats the config info before crashing the second time it reaches "Found 3 GPUs". So it looks like the script itself is running twice. This is similar to the training script above, which repeats the first log lines, but doesn't have the config line like the test script. Does this help diagnose what is going wrong?
Hi @pwrightkcl, perhaps the issue is due to the DataAnalyzer
can not work in multi-node, could you please try AutoRunner
without data analyze?
Hi @pwrightkcl, perhaps the issue is due to the
DataAnalyzer
can not work in multi-node, could you please tryAutoRunner
without data analyze?
Thank you for the suggestion. I'm new to Autoseg3d so can you clarify what you want me to try? I have seen the tutorial breaking down the components of AutoRunner, so I could run DataAnalyzer first on one GPU then run the other steps. Is that what you mean, or is there an input to AutoRunner to tell it to skip the DataAnalyzer step?
I'd be interested if anyone can replicate this problem, since I was able to elicit it just using the Hello World demo.
@ericspod advised me to put my script inside a main
function then add freeze_support()
at the top of that, as suggested in the error message. That appears to have fixed the problem when I try it on the hello world example.
Here's the log:
autoseg3d-test-multi-75dd8bd6d2db.log
This log brings up two related questions about multi-gpu:
OMP_NUM_THREADS
for each process. It sounds like 'process' here means the worker for each GPU. So should I set it to the number of threads I want to use divided by number of GPUs I am using?There appear to be 12 leaked semaphore objects to clean up at shutdown
. What does this mean, and should I be doing something about it?I realise that these questions, although related, are outside the specific issue, so I can move them to Discussions if you prefer.
@KumoLiu I tried autoseg with multiple gpus and it fails with the same stacktrace in DataAnalyzer indeed.
So I've run the data analysis with CUDA_VISIBLE_DEVICE=0
, after I got data analysis I killed the process and restarted it with multiple gpus.
From now on it started fine.
@kretes Sorry for slow response (just got back from leave). Just to confirm, the solution for me was to put my code in a main()
function and to add freeze_support()
as its first line. This prevented DataAnalyzer generating that error. That should be an easier fix than breaking and resuming your pipeline.
Describe the bug When I run auto3dseg with multiple GPUs, it gives an error relating to one process being started before another has finished bootstrapping. It runs with a single GPU.
To Reproduce Steps to reproduce the behavior: I am using a Docker image built on the latest MONAI image. I am submitting the image to RunAI with
--gpu 4
.Expected behavior Run runner does data analysis and begins to train.
Here is a log using one GPU where training starts as expected when requesting only a single GPU.
Screenshots Here is the log:
Environment
Ensuring you use the relevant python executable, please paste the output of:
Note that the command above didn't work, so I had to make a little .py script with each command on one line. The output is for whatever node the debug job was assigned to on our cluster. I requested the same resources as for the training job that failed, but it may not be the same machine as the one that ran my training script, as there are three different kinds of DGXs on our cluster.
Additional context CC: @marksgraham