Auto3DSeg in Azure fails - new to Auto3DSeg

rfrs commented 10 months ago

Dear all, i am starting to use Auto3DSeg to develop a segmentation model for vertebrae from CT images. I am using a A100 GPU in an Azure environment.

I installed pytorch via conda and installed MONAI via the git repository git clone https://github.com/Project-MONAI/MONAI.git. I created the yaml file and the files have been properly assigned. Yet i get the following (long) error when i try to run the pipeline.

(monai) azureuser@rs-a100b:~/cloudfiles/private-info$ python -m monai.apps.auto3dseg AutoRunner run --input="/home/azureuser/cloudfiles/private-info/AutoSeg3D/Spine1/Spine1.yaml"
 missing cuda symbols while dynamic loading
 cuFile initialization failed
2023-10-26 06:50:05,063 - INFO - AutoRunner using work directory ./work_dir
2023-10-26 06:50:05,117 - INFO - Loading input config /home/azureuser/cloudfiles/private-info/AutoSeg3D/Spine1/Spine1.yaml
2023-10-26 06:50:05,361 - INFO - Datalist was copied to work_dir: /mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/work_dir/spine1_folds.json
2023-10-26 06:50:05,379 - INFO - Setting num_fold 1 based on the input datalist /mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/work_dir/spine1_folds.json.
2023-10-26 06:50:05,793 - INFO - Using user defined command running prefix , will override other settings
2023-10-26 06:50:05,803 - INFO - Running data analysis...
2023-10-26 06:50:05,806 - INFO - Found 1 GPUs for data analyzing!
  0%|                                                                                                                                                                                       | 0/307 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/anaconda/envs/monai/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/anaconda/envs/monai/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/MONAI-git/monai/apps/auto3dseg/__main__.py", line 24, in <module>
    fire.Fire(
  File "/anaconda/envs/monai/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/anaconda/envs/monai/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/anaconda/envs/monai/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/MONAI-git/monai/apps/auto3dseg/auto_runner.py", line 743, in run
    da.get_all_case_stats()
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/MONAI-git/monai/apps/auto3dseg/data_analyzer.py", line 230, in get_all_case_stats
    result_bycase = self._get_all_case_stats(0, 1, None, key, transform_list)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/MONAI-git/monai/apps/auto3dseg/data_analyzer.py", line 333, in _get_all_case_stats
    for batch_data in tqdm(dataloader) if (has_tqdm and rank == 0) else dataloader:
  File "/anaconda/envs/monai/lib/python3.9/site-packages/tqdm/std.py", line 1182, in __iter__
    for obj in iterable:
  File "/anaconda/envs/monai/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/anaconda/envs/monai/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/anaconda/envs/monai/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/anaconda/envs/monai/lib/python3.9/site-packages/torch/_utils.py", line 694, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/MONAI-git/monai/transforms/transform.py", line 141, in apply_transform
    return _apply_transform(transform, data, unpack_items, lazy, overrides, log_stats)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/MONAI-git/monai/transforms/transform.py", line 98, in _apply_transform
    return transform(data, lazy=lazy) if isinstance(transform, LazyTrait) else transform(data)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/MONAI-git/monai/transforms/io/dictionary.py", line 161, in __call__
    for key, meta_key, meta_key_postfix in self.key_iterator(d, self.meta_keys, self.meta_key_postfix):
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/MONAI-git/monai/transforms/transform.py", line 475, in key_iterator
    raise KeyError(
KeyError: 'Key `image` of transform `LoadImaged` was missing in the data and allow_missing_keys==False.'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/MONAI-git/monai/transforms/transform.py", line 141, in apply_transform
    return _apply_transform(transform, data, unpack_items, lazy, overrides, log_stats)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/MONAI-git/monai/transforms/transform.py", line 98, in _apply_transform
    return transform(data, lazy=lazy) if isinstance(transform, LazyTrait) else transform(data)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/MONAI-git/monai/transforms/compose.py", line 335, in __call__
    result = execute_compose(
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/MONAI-git/monai/transforms/compose.py", line 111, in execute_compose
    data = apply_transform(
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/MONAI-git/monai/transforms/transform.py", line 171, in apply_transform
    raise RuntimeError(f"applying transform {transform}") from e
RuntimeError: applying transform <monai.transforms.io.dictionary.LoadImaged object at 0x7f36a37dd6d0>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/anaconda/envs/monai/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/anaconda/envs/monai/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/anaconda/envs/monai/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/MONAI-git/monai/data/dataset.py", line 112, in __getitem__
    return self._transform(index)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/MONAI-git/monai/data/dataset.py", line 98, in _transform
    return apply_transform(self.transform, data_i) if self.transform is not None else data_i
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/rs-a100b/private-info/MONAI-git/monai/transforms/transform.py", line 171, in apply_transform
    raise RuntimeError(f"applying transform {transform}") from e
RuntimeError: applying transform <monai.transforms.compose.Compose object at 0x7f36a37dda30>

Your help in troubleshooting this is very much appreciated, especially since i am new to MONAI/Auto3DSeg.

Thanks for all.

Best Rui

KumoLiu commented 10 months ago

Hi @rfrs, from the error message, It looks like your data wasn't prepared quite correctly, you could refer to the " Simulate a special dataset" section in this tutorial. https://github.com/Project-MONAI/tutorials/blob/main/auto3dseg/notebooks/auto3dseg_hello_world.ipynb

Hope it helps, thanks!

rfrs commented 10 months ago

Dear @KumoLiu, thanks for the prompt answer.

Do you, for example, the file (images and labels) paths in the jason file?

Best Rui

KumoLiu commented 10 months ago

Do you, for example, the file (images and labels) paths in the jason file?

https://github.com/Project-MONAI/tutorials/blob/main/auto3dseg/tasks/msd/Task09_Spleen/msd_task09_spleen_folds.json

rfrs commented 10 months ago

Dear @KumoLiu Thanks. I am still lost on how the pipeline will identify the files. In the datalist.jason file i use full paths. an example follows bellow:


"testing": [
        {
            "imageTesting": "/home/azureuser/cloudfiles/code/Users/.../AutoSeg3D/Spine1/imagesTs/spine1_val_000.nii.gz"
        },
        {
            "imageTesting": "/home/azureuser/cloudfiles/code/Users/.../AutoSeg3D/Spine1/imagesTs/spine1_val_001.nii.gz"
        },

The ... represent my username which i ommit here.

What is wrong there? Thank you so much.

Best Rui

KumoLiu commented 10 months ago

Hi @rfrs, the key in the JSON should be "image" instead of "imageTesting". And also perhaps you should notice that the cause of the error is "KeyError".

KeyError: 'Key `image` of transform `LoadImaged` was missing in the data and allow_missing_keys==False.'

rfrs commented 10 months ago

Thanks @KumoLiu for your prompt answer. It was a rookie mistake :(

With the change the data was found but another problem appeared...

the error log is as follows:

(monai) azureuser@...:~/cloudfiles/code/Users/...$ python -m monai.apps.auto3dseg AutoRunner run --input="/home/azureuser/cloudfiles/code/Users/.../Auto3DSeg/Spine1/Spine1.yaml"
 missing cuda symbols while dynamic loading
 cuFile initialization failed
2023-10-26 13:42:44,397 - INFO - AutoRunner using work directory ./work_dir
2023-10-26 13:42:44,639 - INFO - Loading input config /home/azureuser/cloudfiles/code/Users/.../Auto3DSeg/Spine1/Spine1.yaml
2023-10-26 13:42:44,853 - INFO - Datalist was copied to work_dir: /mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/spine1_folds.json
2023-10-26 13:42:44,862 - INFO - Setting num_fold 1 based on the input datalist /mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/spine1_folds.json.
2023-10-26 13:42:45,261 - INFO - Using user defined command running prefix , will override other settings
2023-10-26 13:42:45,262 - INFO - Running data analysis...
2023-10-26 13:42:45,266 - INFO - Found 1 GPUs for data analyzing!
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 307/307 [07:58<00:00,  1.56s/it]
2023-10-26 13:50:46,983 - INFO - Data spacing is not completely uniform. MONAI transforms may provide unexpected result
2023-10-26 13:50:46,983 - INFO - Writing data stats to /mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/datastats.yaml.
2023-10-26 13:50:47,215 - INFO - Writing by-case data stats to /mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/datastats_by_case.yaml, this may take a while.
2023-10-26 13:50:50,862 - INFO - BundleGen from https://github.com/Project-MONAI/research-contributions/releases/download/algo_templates/249bf4b.tar.gz
algo_templates.tar.gz: 104kB [00:00, 173kB/s]                                                                                                                                                                       
2023-10-26 13:50:51,480 - INFO - Downloaded: /tmp/tmpggt4hakn/algo_templates.tar.gz
2023-10-26 13:50:51,480 - INFO - Expected md5 is None, skip md5 check for file /tmp/tmpggt4hakn/algo_templates.tar.gz.
2023-10-26 13:50:51,484 - INFO - Writing into directory: /mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir.
2023-10-26 13:51:19,093 - INFO - Generated:/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0
2023-10-26 13:51:26,452 - INFO - Generated:/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/segresnet_0
2023-10-26 13:51:26,598 - INFO - segresnet2d_0 is skipped! SegresNet2D is skipped due to median spacing of [1.5, 1.5, 1.5],  which means the dataset is not highly anisotropic, e.g. spacing[2] < 3*(spacing[0] + spacing[1])/2) .
2023-10-26 13:51:34,986 - INFO - Generated:/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/swinunetr_0
2023-10-26 13:51:36,567 - INFO - ['python', '/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/scripts/train.py', 'run', "--config_file='/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/configs/hyper_parameters.yaml,/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/configs/hyper_parameters_search.yaml,/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/configs/network.yaml,/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/configs/network_search.yaml,/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/configs/transforms_infer.yaml,/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/configs/transforms_train.yaml,/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/configs/transforms_validate.yaml'"]
 missing cuda symbols while dynamic loading
 cuFile initialization failed
monai.transforms.croppad.dictionary CropForegroundd.__init__:allow_smaller: Current default value of argument `allow_smaller=True` has been deprecated since version 1.2. It will be changed to `allow_smaller=False` in version 1.5.
Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/scripts/train.py", line 984, in <module>
    fire.Fire()
  File "/anaconda/envs/monai/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/anaconda/envs/monai/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/anaconda/envs/monai/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/scripts/train.py", line 389, in run
    train_loader = DataLoader(
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../MONAI-git/monai/data/dataloader.py", line 106, in __init__
    super().__init__(dataset=dataset, num_workers=num_workers, **kwargs)
  File "/anaconda/envs/monai/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 349, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
  File "/anaconda/envs/monai/lib/python3.9/site-packages/torch/utils/data/sampler.py", line 140, in __init__
    raise ValueError(f"num_samples should be a positive integer value, but got num_samples={self.num_samples}")
ValueError: num_samples should be a positive integer value, but got num_samples=0
Traceback (most recent call last):
  File "/anaconda/envs/monai/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/anaconda/envs/monai/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../MONAI-git/monai/apps/auto3dseg/__main__.py", line 24, in <module>
    fire.Fire(
  File "/anaconda/envs/monai/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/anaconda/envs/monai/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/anaconda/envs/monai/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../MONAI-git/monai/apps/auto3dseg/auto_runner.py", line 806, in run
    self._train_algo_in_sequence(history)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../MONAI-git/monai/apps/auto3dseg/auto_runner.py", line 658, in _train_algo_in_sequence
    algo.train(self.train_params, self.device_setting)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/algorithm_templates/dints/scripts/algo.py", line 490, in train
    return self._run_cmd(cmd, devices_info)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../MONAI-git/monai/apps/auto3dseg/bundle_gen.py", line 255, in _run_cmd
    return run_cmd(cmd.split(), run_cmd_verbose=True, env=ps_environ, check=True)
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../MONAI-git/monai/utils/misc.py", line 874, in run_cmd
    return subprocess.run(cmd_list, **kwargs)
  File "/anaconda/envs/monai/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['python', '/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/scripts/train.py', 'run', "--config_file='/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/configs/hyper_parameters.yaml,/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/configs/hyper_parameters_search.yaml,/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/configs/network.yaml,/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/configs/network_search.yaml,/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/configs/transforms_infer.yaml,/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/configs/transforms_train.yaml,/mnt/batch/tasks/shared/LS_root/mounts/clusters/.../code/Users/.../work_dir/dints_0/configs/transforms_validate.yaml'"]' returned non-zero exit status 1.

Any further suggestions?

Thanks a lot.

rfrs commented 10 months ago

Also, is there a way to specify epoch number in the yaml file for example?

Best

diazandr3s commented 10 months ago

Can you share the JSON and YAML files you're using to train Auto3D? How many folds did you create?

rfrs commented 10 months ago

files.zip

Hi @diazandr3s, thanks for message.

I added the json and the yaml files in the zipped folder.

Also, i was trying to do fold 0, so one fold only, thus training all the data in one go. I was doing the same with nnUNet. Could this be the problem as well?

Also, i ask again, can i set from the beginning the number of epochs the training will take?

Thanks for all the help.

Best

Rui

KumoLiu commented 10 months ago

Hi @rfrs,

from the error message, it seems the dataloader didn't find any data from your JSON/YAML file.
sure you can set the epoch num at the beginning, you can refer to the "Override the training parameters so that we can complete the pipeline in minutes" part in this tutorial.

I strongly recommend that you use the fake data and follow this tutorial to run through the process first, understand auto3dseg, and then carry out your own tasks. Hope it helps, thanks!

diazandr3s commented 10 months ago

files.zip

Hi @diazandr3s, thanks for message.

I added the json and the yaml files in the zipped folder.

Also, i was trying to do fold 0, so one fold only, thus training all the data in one go. I was doing the same with nnUNet. Could this be the problem as well?

Also, i ask again, can i set from the beginning the number of epochs the training will take?

Thanks for all the help.

Best

Rui

Hi @rfrs,

Thanks for sending the files. I see from the JSON file you sent that it only has fold 0. Auto3D needs at least 2 folds. This is a known issue. Please create a JSON file with at least 2 folds, trigger the training again and let us know.

rfrs commented 10 months ago

Dear @diazandr3s and @KumoLiu thank you so much for your support so far.

Following the tutorial and also having a minimum of 2 folds sorted the issues, although i am still struggling with errors at the essembling step: not enough vRAM despite having a A100 80GB GPU in azure.

Two questions: 1) For the number of epochs, something is not clear to me. Does each epoch have 40000 iterations as stated here ?

2) To assign the work_dir to a specific folder how can in do it? I already tried to follow what is done in the tutorial and it does not seem to save there and always creating a work_dir outside the folder i assigned.

Thank you for all.

Best wishes.

diazandr3s commented 10 months ago

Hi @rfrs,

Thanks for the update. Regarding your questions/comments:

Following the tutorial and also having a minimum of 2 folds sorted the issues, although i am still struggling with errors at the essembling step: not enough vRAM despite having a A100 80GB GPU in azure.

Do you mean for inference or training?

For the number of epochs, something is not clear to me. Does each epoch have 40000 iterations as stated here ?

The number of epochs and other hyperparameters are updated/changed after the data analysis. As it says here: https://github.com/Project-MONAI/tutorials/blob/0838cd65f445fa73d72d113eca97819e7f4098f3/auto3dseg/docs/algorithm_generation.md?plain=1#L100

To assign the work_dir to a specific folder how can in do it? I already tried to follow what is done in the tutorial and it does not seem to save there and always creating a work_dir outside the folder i assigned.

Yes, you can specify a different work_dir path.

Here is an example of how you can use Auto3D with a single backbone network (segresnet) and a specific work dir:

python -m monai.apps.auto3dseg AutoRunner run --input /PATH_TO_YAML_FILE.yaml --work_dir /MY_WORK_DIR_PATH --algos segresnet

Hope this helps,

rfrs commented 10 months ago

Dear @diazandr3s, thank you so much.

It worked for me and i could set the work_dir and also epoch number. Auto3DSeg is working for me in Azure and i could generate models.

One more question. I am again a bit lost how can i load a trained .pt model (generated in Autoseg) in monai label via monailabel start_server... What is the needed folder structure to load directly my model?

Thanks best Rui

rfrs commented 10 months ago

I have been following the tutorial here https://github.com/Project-MONAI/MONAILabel/tree/main/sample-apps#radiology and created the files, yet, i get the error message stating 'lib.configs.spine4.py'; 'lib.configs.spine4' is not a package In all lib: infer, train and config i created a file (clone of speel segmentation with modifications)... anything else needed to be added?

diazandr3s commented 10 months ago

Dear @diazandr3s, thank you so much.

It worked for me and i could set the work_dir and also epoch number. Auto3DSeg is working for me in Azure and i could generate models.

One more question. I am again a bit lost how can i load a trained .pt model (generated in Autoseg) in monai label via monailabel start_server... What is the needed folder structure to load directly my model?

Thanks best Rui

Hi @rfrs,

This is a very good question! In theory, Auto3D creates Bundle models (https://github.com/Project-MONAI/MONAI/blob/dev/monai/apps/auto3dseg/auto_runner.py#L809-L816). You should be able to use the monaibundle app in MONAI Label to consume a generated model by Auto3D. However, this should only work for a single model in Auto3D. Remember that Auto3D creates an ensemble of models. Using all these models as an ensemble in MONAI Label is still not supported.

Another way of consuming one model in MONAI Label is to modify the Radiology app - Segmentation model. For this, you should update the label names and indexes, network architecture and pre-transforms.

Please give it a try. Otherwise, I'd suggest we move this conversation (consuming Bundle generated models in Auto3D in MONAI Label) to the MONAI Core repo.

I hope this makes sense.

rfrs commented 10 months ago

Dear @diazandr3s, thanks for the reply. From the different models (ensemble) that Auto3DSeg produced i chose the one with the best metrics, in this case the unetr. Then i tried to follow the example here https://github.com/Project-MONAI/MONAILabel/tree/main/sample-apps#radiology but got the error'lib.configs.spine4.py'; 'lib.configs.spine4' is not a package'. I had cloned the radiology app folder, duplicated the segmentation scripts for train, infer and config as you just mentioned, yet i get that error that the model is not a package.

How can we move this discussion into the MONAI Core repo? Thanks Best

diazandr3s commented 10 months ago

Hi @rfrs,

I'd suggest we start a discussion in the MONAI Core repo with a title like: Using one Bundle generated model in Auto3DSeg in MONAI Label

Once you created the discussion, please link this conversation so others can also comment there.

rfrs commented 9 months ago

Dear @diazandr3s and @KumoLiu , once more i have issues when running Auto3DSeg on Azure/Cloud setting. At the moment we are running in a cluster composed of 2 compute instances, each one with a 1x V100 GPU. I am getting the following error:

Type of images being analysed:
ct
Path for the datalist json file:
Users/.../Auto3DSeg/spine3_TS/cluster_folds.json
Path for the working directory:
Users/.../Auto3DSeg/spine3_TS/work_dir
Path for the base directory:
Users/.../Auto3DSeg/spine3_TS
Choose the DNN architecture for training: DINTS (dinst), SegResNet (segresnet) or UNet (swinunetr):
net
2023-12-07 07:10:06,866 - INFO - AutoRunner using work directory /media/.../Users/.../Auto3DSeg/spine3_TS/work_dir
2023-12-07 07:10:07,077 - INFO - Datalist was copied to work_dir: /media/.../Users/.../Auto3DSeg/spine3_TS/work_dir/cluster_folds.json
2023-12-07 07:10:07,086 - INFO - Setting num_fold 2 based on the input datalist /media/.../Users/.../Auto3DSeg/spine3_TS/work_dir/cluster_folds.json.
2023-12-07 07:10:07,227 - INFO - Using user defined command running prefix , will override other settings

Maximum number of epochs:
100

Number of folds is set by the datalist file!

2023-12-07 07:10:11,914 - INFO - Running data analysis...
2023-12-07 07:10:11,918 - INFO - Found 2 GPUs for data analyzing!
Importing dependencies for Auto3DSeg.
 missing cuda symbols while dynamic loading
 cuFile initialization failed
MONAI version: 1.3.0
Numpy version: 1.26.0
Pytorch version: 2.1.1
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: 865972f7a791bf7b42efbcd87c8402bd865b329e
MONAI __file__: /home/DMU/anaconda3/envs/monai9/lib/python3.9/site-packages/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: 0.4.11
ITK version: 5.3.0
Nibabel version: 5.1.0
scikit-image version: 0.22.0
scipy version: 1.11.4
Pillow version: 10.0.1
Tensorboard version: 2.15.1
gdown version: 4.7.1
TorchVision version: 0.16.1
tqdm version: 4.66.1
lmdb version: 1.4.1
psutil version: 5.9.6
pandas version: 2.1.3
einops version: 0.7.0
transformers version: 4.35.2
mlflow version: 2.9.0
pynrrd version: 1.0.0
clearml version: 1.13.2

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

Type of images being analysed:
Traceback (most recent call last):
  File "/home/DMU/anaconda3/envs/monai9/lib/python3.9/multiprocessing/forkserver.py", line 274, in main
    code = _serve_one(child_r, fds,
  File "/home/DMU/anaconda3/envs/monai9/lib/python3.9/multiprocessing/forkserver.py", line 313, in _serve_one
    code = spawn._main(child_r, parent_sentinel)
  File "/home/DMU/anaconda3/envs/monai9/lib/python3.9/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/home/DMU/anaconda3/envs/monai9/lib/python3.9/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/DMU/anaconda3/envs/monai9/lib/python3.9/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/home/DMU/anaconda3/envs/monai9/lib/python3.9/runpy.py", line 288, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/DMU/anaconda3/envs/monai9/lib/python3.9/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/DMU/anaconda3/envs/monai9/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/media/.../Users/.../Auto3DSeg/runnerCluster.py", line 23, in <module>
    modality = str(input("Type of images being analysed:\n"))
EOFError: EOF when reading a line
Traceback (most recent call last):
  File "/media/.../Users/.../Auto3DSeg/runnerCluster.py", line 78, in <module>
    runner.run()
  File "/home/DMU/anaconda3/envs/monai9/lib/python3.9/site-packages/monai/apps/auto3dseg/auto_runner.py", line 743, in run
    da.get_all_case_stats()
  File "/home/DMU/anaconda3/envs/monai9/lib/python3.9/site-packages/monai/apps/auto3dseg/data_analyzer.py", line 214, in get_all_case_stats
    with tmp_ctx.Manager() as manager:
  File "/home/DMU/anaconda3/envs/monai9/lib/python3.9/multiprocessing/context.py", line 57, in Manager    m.start()
  File "/home/DMU/anaconda3/envs/monai9/lib/python3.9/multiprocessing/managers.py", line 558, in start    self._address = reader.recv()
  File "/home/DMU/anaconda3/envs/monai9/lib/python3.9/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/DMU/anaconda3/envs/monai9/lib/python3.9/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/DMU/anaconda3/envs/monai9/lib/python3.9/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

So far, in all MONAI/Auto3DSeg tests i never had multiprocessing errors, so i am really not sure what is wrong... Any help is greatly appreciated.

Thanks for all.

Best wishes Rui

diazandr3s commented 9 months ago

Hi @rfrs,

This is strange. Have you tried using the MONAI Docker container? You could start a bash inside the MONAI container and map the working folder with this command:

docker run --gpus all --rm -ti --ipc=host --net=host -v /LOCAL_FOLDER:/DOCKER_WORKING_FOLDER projectmonai/monai bash

Please try this and let us know,

rfrs commented 9 months ago

Dear @diazandr3s i apologise for the delayed answer. We have been installing monai via conda maybe that influences it. We have not yet tested to have it running via docker. We will do so.

In the meantime i do have another issue. We are testing different approaches when it comes to cloud training. We are testing both compute instances and clusters in Azure. While for multiple GPUs the issue is as the above, i can run Auto3DSeg in a compute instance without a glitch. When running it in a cluster i am having issues with exactly the same script and GPU configuration (one A100 unit).

The error is as follows:

(monai1)...:/media/tstoresearch02/...$ sudo /home/DMU/anaconda3/envs/monai1/bin/python Auto3DSeg/runnerCluster.py
sudo: unable to resolve host 6166697d9f19477f92ccb47484cc2744000: Name or service not known
Importing dependencies for Auto3DSeg.
MONAI version: 1.3.0
Numpy version: 1.26.0
Pytorch version: 2.1.1
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: 865972f7a791bf7b42efbcd87c8402bd865b329e
MONAI __file__: /home/DMU/anaconda3/envs/monai1/lib/python3.9/site-packages/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: 0.4.11
ITK version: 5.3.0
Nibabel version: 5.2.0
scikit-image version: 0.22.0
scipy version: 1.11.4
Pillow version: 10.0.1
Tensorboard version: 2.15.1
gdown version: 4.7.1
TorchVision version: 0.16.1
tqdm version: 4.66.1
lmdb version: 1.4.1
psutil version: 5.9.6
pandas version: 2.1.4
einops version: 0.7.0
transformers version: 4.36.0
mlflow version: 2.9.1
pynrrd version: 1.0.0
clearml version: 1.13.2

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

Type of images being analysed:
ct
Path for the datalist json file:
Users/.../Auto3DSeg/Spine3_Auto3DSeg/ts_cluster_folds.json
Path for the working directory:
Users/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir
Path for the base directory:
Users/.../Auto3DSeg/Spine3_Auto3DSeg/
Choose the DNN architecture for training: DINTS (dinst), SegResNet (segresnet) or UNet (swinunetr):
net
2023-12-13 09:01:07,993 - INFO - AutoRunner using work directory /media/tstoresearch02/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir
2023-12-13 09:01:08,191 - INFO - Datalist was copied to work_dir: /media/tstoresearch02/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir/ts_cluster_folds.json
2023-12-13 09:01:08,213 - INFO - Setting num_fold 2 based on the input datalist /media/tstoresearch02/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir/ts_cluster_folds.json.
2023-12-13 09:01:08,839 - INFO - Using user defined command running prefix , will override other settings

Maximum number of epochs:
100

Number of folds is set by the datalist file!

2023-12-13 09:01:12,795 - INFO - Skipping data analysis...
2023-12-13 09:01:12,795 - INFO - Skipping algorithm generation...
2023-12-13 09:01:12,903 - INFO - ['python', '/media/tstoresearch02/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir/swinunetr_0/scripts/train.py', 'run', "--config_file='/media/tstoresearch02/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir/swinunetr_0/configs/hyper_parameters.yaml,/media/tstoresearch02/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir/swinunetr_0/configs/network.yaml,/media/tstoresearch02/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir/swinunetr_0/configs/transforms_infer.yaml,/media/tstoresearch02/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir/swinunetr_0/configs/transforms_train.yaml,/media/tstoresearch02/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir/swinunetr_0/configs/transforms_validate.yaml'", '--num_epochs_per_validation=2', '--num_images_per_batch=2', '--num_epochs=100', '--num_warmup_epochs=1']
  File "/media/tstoresearch02/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir/swinunetr_0/scripts/train.py", line 97
    logger.debug(f"EarlyStopping counter: {self.counter} out of {self.patience}")
                                                                               ^
SyntaxError: invalid syntax
Traceback (most recent call last):
  File "/media/tstoresearch02/.../Auto3DSeg/runnerCluster.py", line 78, in <module>
    runner.run()
  File "/home/DMU/anaconda3/envs/monai1/lib/python3.9/site-packages/monai/apps/auto3dseg/auto_runner.py", line 806, in run
    self._train_algo_in_sequence(history)
  File "/home/DMU/anaconda3/envs/monai1/lib/python3.9/site-packages/monai/apps/auto3dseg/auto_runner.py", line 658, in _train_algo_in_sequence
    algo.train(self.train_params, self.device_setting)
  File "/home/DMU/anaconda3/envs/monai1/lib/python3.9/site-packages/monai/apps/auto3dseg/bundle_gen.py", line 278, in train
    return self._run_cmd(cmd)
  File "/home/DMU/anaconda3/envs/monai1/lib/python3.9/site-packages/monai/apps/auto3dseg/bundle_gen.py", line 255, in _run_cmd
    return run_cmd(cmd.split(), run_cmd_verbose=True, env=ps_environ, check=True)
  File "/home/DMU/anaconda3/envs/monai1/lib/python3.9/site-packages/monai/utils/misc.py", line 874, in run_cmd
    return subprocess.run(cmd_list, **kwargs)
  File "/home/DMU/anaconda3/envs/monai1/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['python', '/media/tstoresearch02/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir/swinunetr_0/scripts/train.py', 'run', "--config_file='/media/tstoresearch02/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir/swinunetr_0/configs/hyper_parameters.yaml,/media/tstoresearch02/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir/swinunetr_0/configs/network.yaml,/media/tstoresearch02/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir/swinunetr_0/configs/transforms_infer.yaml,/media/tstoresearch02/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir/swinunetr_0/configs/transforms_train.yaml,/media/tstoresearch02/.../Auto3DSeg/Spine3_Auto3DSeg/work_dir/swinunetr_0/configs/transforms_validate.yaml'", '--num_epochs_per_validation=2', '--num_images_per_batch=2', '--num_epochs=100', '--num_warmup_epochs=1']' returned non-zero exit status 1.

So it seems there is an problem with the train.py script ... but i have never encountered such when training in a compute instance. Would you have some suggestions? To run it in the cluster, i need sudopermissions to run the script which i adapted from the Hello World Tutorial. Your input is greatly appreciated.

Thank you for all.

Best wishes

diazandr3s commented 9 months ago

Hi @rfrs,

How many GPUs are you using here? Are they interconnected? What are the cluster specs?

Please try running the AutoRunner instead. As it is explained here: https://github.com/Project-MONAI/tutorials/tree/main/auto3dseg#1-run-with-minimal-input-using-autorunner

Let us know the output. I'd like to understand why you're getting this error.

barrettfletcher commented 1 week ago

Hi @diazandr3s, I am having a similar problem related to this thread. I am using autoseg in a shared cluster environment with access to either v100 or a100 GPUs with 80gb (I’ve tried both). I was able to run the hello_world example without an issue, which is great! I then tried to run similar code using the Task_04_Prostate data as a real world example, but the code failed as the gpu ran out of memory. The image file sizes from the hello world example are quite small compared to the images in the prostate data (I suspect this is because the helloworld images are primarily empty). On the monai tutorial it looks like it should be possible to run on a single gpu. Any ideas why this might be happening?

rfrs commented 1 week ago

Hi @barrettfletcher, usually the datasets come in the compressed .nii.gz format, and are uncompressed during GPU-loading. Always count with at least 2-3x more vRAM needed than the dataset file size. Also, depending on the dataset pixel scale, the smaller the voxel the more vRAM will be needed for compuation. We can downscale the voxel size or also load smaller patches.

Cheers

barrettfletcher commented 1 week ago

Hi @rfrs, Thanks so much for responding! What you said prompted me to go back to basics. I tried running the Hello_world example but only changed the image dimensions from 64x64x64 to 100x512x512 (to resemble a real MRI, for example). File sizes (in their .nii.gz) changed from 15KB to 461 KB which is still quite small compared to real MRIs. I ran into the same issue :/ On the A100 80GB, I get this error,

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.53 GiB. GPU 0 has a total capacity of 79.15 GiB of which 715.75 MiB is free. Process 3147407 has 490.00 MiB memory in use. Including non-PyTorch memory, this process has 77.94 GiB memory in use. Of the allocated memory 76.32 GiB is allocated by PyTorch, and 1.12 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I would like to try loading smaller patches if you think that will fix this issue. I've looked around the documentation for Autoseg but I can't find where the patch size is defined. Do you know where that might be?

Cheers, Fletcher

Project-MONAI / tutorials

Auto3DSeg in Azure fails - new to Auto3DSeg #1554