udiram commented 1 year ago

Describe the bug running autorunner on AMOS22 dataset provides data analysis but fails on training.

To Reproduce run autorunner on AMOS22 dataset

Expected behavior running training with no internal fold errors Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Printing MONAI config...

MONAI version: 1.0.0+41.gd327088e Numpy version: 1.23.4 Pytorch version: 1.13.0+cu117 MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False MONAI rev id: d327088e690ec0899dda09db1f83d0bb8971a425 MONAI file: /home/exouser/.local/lib/python3.8/site-packages/monai/init.py

Optional dependencies: Pytorch Ignite version: 0.4.9 Nibabel version: 3.1.1 scikit-image version: 0.19.3 Pillow version: 9.2.0 Tensorboard version: 2.10.1 gdown version: 4.5.1 TorchVision version: 0.10.0+cu111 tqdm version: 4.63.0 lmdb version: 1.3.0 psutil version: 5.5.1 pandas version: 1.0.3 einops version: 0.4.1 transformers version: 4.23.1 mlflow version: 1.27.0 pynrrd version: 1.0.0

For details about installing the optional dependencies, please visit: https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

================================ Printing system config...

System: Linux Linux version: Ubuntu 20.04.4 LTS Platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.29 Processor: x86_64 Machine: x86_64 Python version: 3.8.10 Process name: python Command: ['python', '-c', 'import monai; monai.config.print_debug_info()'] Open files: [] Num physical CPUs: 32 Num logical CPUs: 32 Num usable CPUs: 32 CPU usage (%): [4.3, 4.6, 4.3, 4.9, 4.3, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.6, 4.3, 4.3, 4.3, 4.6, 4.3, 4.6, 5.3, 4.3, 4.6, 4.6, 4.6, 4.6, 4.9, 4.3, 4.6, 5.3, 100.0] CPU freq. (MHz): UNKNOWN for given OS Load avg. in last 1, 5, 15 mins (%): UNKNOWN for given OS Disk usage (%): 25.1 Avg. sensor temp. (Celsius): UNKNOWN for given OS Total physical memory (GB): 122.8 Available memory (GB): 118.7 Used memory (GB): 3.0

================================ Printing GPU config...

Num GPUs: 1 Has CUDA: True CUDA version: 11.7 cuDNN enabled: True cuDNN version: 8500 Current device: 0 Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86'] GPU 0 Name: GRID A100X-40C GPU 0 Is integrated: False GPU 0 Is multi GPU board: False GPU 0 Multi processor count: 108 GPU 0 Total memory (GB): 40.0 GPU 0 CUDA capability (maj.min): 8.0

Additional context this continues after solving PATH errors from https://github.com/Project-MONAI/tutorials/issues/1126#issue-1512194481 by starting on HPC linux instance

error: exouser@multi-modal:~/auto3dseg$ python -m monai.apps.auto3dseg AutoRunner run --input='./task.yaml' 2022-12-28 16:59:36,428 - INFO - Work directory ./work_dir is used to save all results 2022-12-28 16:59:36,428 - INFO - Loading ./task.yaml for AutoRunner and making a copy in /home/exouser/auto3dseg/work_dir/input.yaml 2022-12-28 16:59:36,435 - INFO - The output_dir is not specified. /home/exouser/auto3dseg/work_dir/ensemble_output will be used to save ensemble predictions 2022-12-28 16:59:36,435 - INFO - Found cached results and skipping data analysis... 2022-12-28 16:59:36,435 - INFO - Found cached results and skipping algorithm generation... 2022-12-28 16:59:36,447 - INFO - Launching: python /home/exouser/auto3dseg/work_dir/segresnet_0/scripts/train.py run --config_file='/home/exouser/auto3dseg/work_dir/segresnet_0/configs/transforms_train.yaml','/home/exouser/auto3dseg/work_dir/segresnet_0/configs/transforms_validate.yaml','/home/exouser/auto3dseg/work_dir/segresnet_0/configs/transforms_infer.yaml','/home/exouser/auto3dseg/work_dir/segresnet_0/configs/network.yaml','/home/exouser/auto3dseg/work_dir/segresnet_0/configs/hyper_parameters.yaml' Traceback (most recent call last): File "/home/exouser/.local/lib/python3.8/site-packages/monai/apps/auto3dseg/bundle_gen.py", line 183, in _run_cmd normal_out = subprocess.run(cmd.split(), env=ps_environ, check=True, capture_output=True) File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['python', '/home/exouser/auto3dseg/work_dir/segresnet_0/scripts/train.py', 'run', "--config_file='/home/exouser/auto3dseg/work_dir/segresnet_0/configs/transforms_train.yaml','/home/exouser/auto3dseg/work_dir/segresnet_0/configs/transforms_validate.yaml','/home/exouser/auto3dseg/work_dir/segresnet_0/configs/transforms_infer.yaml','/home/exouser/auto3dseg/work_dir/segresnet_0/configs/network.yaml','/home/exouser/auto3dseg/work_dir/segresnet_0/configs/hyper_parameters.yaml'"]' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/exouser/.local/lib/python3.8/site-packages/monai/apps/auto3dseg/main.py", line 22, in fire.Fire( File "/home/exouser/.local/lib/python3.8/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/exouser/.local/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/exouser/.local/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace component = fn(*varargs, *kwargs) File "/home/exouser/.local/lib/python3.8/site-packages/monai/apps/auto3dseg/auto_runner.py", line 586, in run self._train_algo_in_sequence(history) File "/home/exouser/.local/lib/python3.8/site-packages/monai/apps/auto3dseg/auto_runner.py", line 488, in _train_algo_in_sequence algo.train(self.train_params) File "/home/exouser/.local/lib/python3.8/site-packages/monai/apps/auto3dseg/bundle_gen.py", line 200, in train return self._run_cmd(cmd, devices_info) File "/home/exouser/.local/lib/python3.8/site-packages/monai/apps/auto3dseg/bundle_gen.py", line 188, in _run_cmd raise RuntimeError(f"subprocess call error {e.returncode}: {errors}, {output}") from e RuntimeError: subprocess call error 1: b'Traceback (most recent call last): File "/home/exouser/auto3dseg/work_dir/segresnet_0/scripts/train.py", line 405, in fire.Fire() File "/home/exouser/.local/lib/python3.8/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/exouser/.local/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/exouser/.local/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace component = fn(varargs, **kwargs) File "/home/exouser/auto3dseg/work_dir/segresnet_0/scripts/train.py", line 86, in run if item["fold"] == fold: KeyError: \'fold\' ', b'[info] number of GPUs: 1 [info] world_size: 1 '

mingxin-zheng commented 1 year ago

Hi @udiram , can you also provide a snapshot of your datalist?

This is irrelevant but I notice you are using an earlier version of MONAI (1.0.0). There are a couple of important fixes on Auto3DSeg in later versions (1.0.1 and 1.1). If there is some reason to use an earlier MONAI specifically, I would recommend 1.0.1 at least to avoid some known bugs. Thanks!

udiram commented 1 year ago

Hi @mingxin-zheng, here's some entries in the datalist

{ "training": [ { "image": "./imagesTr/amos_0001.nii.gz", "label": "./labelsTr/amos_0001.nii.gz", "fold:": 1 }, { "image": "./imagesTr/amos_0004.nii.gz", "label": "./labelsTr/amos_0004.nii.gz", "fold:": 1 }, { "image": "./imagesTr/amos_0005.nii.gz", "label": "./labelsTr/amos_0005.nii.gz", "fold:": 1 }, { "image": "./imagesTr/amos_0006.nii.gz", "label": "./labelsTr/amos_0006.nii.gz", "fold:": 1 with additional lines that cover the entire dataset, split into 5 equal folds. there are also test files specified which look like this:

"testing": [ "./imagesTs/amos_0008.nii.gz", "./imagesTs/amos_0013.nii.gz", "./imagesTs/amos_0018.nii.gz", "./imagesTs/amos_0022.nii.gz", "./imagesTs/amos_0029.nii.gz", "./imagesTs/amos_0032.nii.gz", "./imagesTs/amos_0034.nii.gz", "./imagesTs/amos_0040.nii.gz",

My task yaml looks like this:

name: Task500_AMOS task: segmentation modality: CT datalist: "task500_AMOS.json" # list of files dataroot: "Task500_AMOS" # data location

my directory is setup as follows:

exouser@multi-modal:~/auto3dseg$ ls Task500_AMOS task.yaml task500_AMOS.json work_dir

exouser@multi-modal:~/auto3dseg/Task500_AMOS$ ls imagesTr imagesTs labelsTr

I just tried the same after reinstalling the latest monai version (stable not weekly) to the same output, as you said, probably not what's making a difference, but thought I would try and report back nonetheless.

mingxin-zheng commented 1 year ago

Can you also try to make the fold counting start from 0 in your datalist and see if it can fix the issue? Thanks @udiram

udiram commented 1 year ago

hi @mingxin-zheng thanks for the input, while changing it to a zero index I noticed the problem, since there's no way to add folds to an existing msd datalist, I had developed a script that automated it. In doing so, I had appended "folds:": with an extra ':' within the quotations, this is likely why it failed. I feel it would be helpful to include instructions to generate the datalist file within the autorunner tutorial, could I make a PR for a python/.ipynb script that generates that type of datalist?

Thanks again for all the help!

mingxin-zheng commented 1 year ago

Hi @udiram , happy to hear that. Please feel free to submit a PR when you have the bandwidth!

I would suggest a Jupyter notebook as it is straightforward to include in the tutorial repo.

udiram commented 1 year ago

Just submitted one! https://github.com/Project-MONAI/tutorials/pull/1129. Let me know if there's anything else that needs to be included.

Cheers

Project-MONAI / tutorials

AutoRunner Folds #1128

Environment (please complete the following information):

Printing MONAI config...

================================ Printing system config...

================================ Printing GPU config...