Project-MONAI / tutorials

MONAI Tutorials
https://monai.io/started.html
Apache License 2.0
1.85k stars 682 forks source link

Error encountered while running Hecktor tutorial for Head and Neck tumor segmentation #1246

Closed BSonya closed 1 year ago

BSonya commented 1 year ago

Hi. I am trying to utilize the Hecktor tutorial for Head and Neck tumor segmentation on the Hecktor dataset. As a beginner, I decided to go with a small part of the dataset, just 5 pairs of CT and PET images initially (one in each training fold) for practice. I have also specified data for inference in datalist file. However, after training all 5 folds, I encountered the following error:

=> loaded checkpoint /content/tutorials/auto3dseg/tasks/hecktor22/work_dir/segresnet_0/model/model.pt (epoch 0) (best_metric 0.0022504243534058332)
Total parameters count 87165548 distributed False
Using custom transforms {'after_resample_transforms': [<hecktor_crop_neck_region.HecktorCropNeckRegion object at 0x7f8e40183970>]}
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/monai/transforms/transform.py", line 102, in apply_transform
    return _apply_transform(transform, data, unpack_items)
  File "/usr/local/lib/python3.8/dist-packages/monai/transforms/transform.py", line 66, in _apply_transform
    return transform(parameters)
  File "/usr/local/lib/python3.8/dist-packages/monai/transforms/spatial/dictionary.py", line 429, in inverse
    d[key] = self.spacing_transform.inverse(d[key])
  File "/usr/local/lib/python3.8/dist-packages/monai/transforms/spatial/array.py", line 620, in inverse
    return self.sp_resample.inverse(data)
  File "/usr/local/lib/python3.8/dist-packages/monai/transforms/spatial/array.py", line 325, in inverse
    transform = self.pop_transform(data)
  File "/usr/local/lib/python3.8/dist-packages/monai/transforms/inverse.py", line 206, in pop_transform
    return self.get_most_recent_transform(data, key, check, pop=True)
  File "/usr/local/lib/python3.8/dist-packages/monai/transforms/inverse.py", line 188, in get_most_recent_transform
    self.check_transforms_match(all_transforms[-1])
  File "/usr/local/lib/python3.8/dist-packages/monai/transforms/inverse.py", line 155, in check_transforms_match
    raise RuntimeError(
RuntimeError: Error SpatialResample getting the most recently applied invertible transform HecktorCropNeckRegion 140248937412976 != 140248937381552.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/monai/transforms/transform.py", line 102, in apply_transform
    return _apply_transform(transform, data, unpack_items)
  File "/usr/local/lib/python3.8/dist-packages/monai/transforms/transform.py", line 66, in _apply_transform
    return transform(parameters)
  File "/usr/local/lib/python3.8/dist-packages/monai/transforms/post/dictionary.py", line 701, in __call__
    inverted = self.transform.inverse(input_dict)
  File "/usr/local/lib/python3.8/dist-packages/monai/transforms/compose.py", line 184, in inverse
    data = apply_transform(t.inverse, data, self.map_items, self.unpack_items, self.log_stats)
  File "/usr/local/lib/python3.8/dist-packages/monai/transforms/transform.py", line 129, in apply_transform
    raise RuntimeError(f"applying transform {transform}") from e
RuntimeError: applying transform <bound method Spacingd.inverse of <monai.transforms.spatial.dictionary.Spacingd object at 0x7f8e4017bfa0>>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/monai/apps/auto3dseg/__main__.py", line 22, in <module>
    fire.Fire(
  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/monai/apps/auto3dseg/auto_runner.py", line 703, in run
    preds = ensembler(pred_param=self.pred_params)
  File "/usr/local/lib/python3.8/dist-packages/monai/apps/auto3dseg/ensemble_builder.py", line 160, in __call__
    pred = infer_instance.predict(predict_files=[file], predict_params=param)
  File "/usr/local/lib/python3.8/dist-packages/monai/apps/auto3dseg/bundle_gen.py", line 269, in predict
    return [inferer.infer(f) for f in ensure_tuple(predict_files)]
  File "/usr/local/lib/python3.8/dist-packages/monai/apps/auto3dseg/bundle_gen.py", line 269, in <listcomp>
    return [inferer.infer(f) for f in ensure_tuple(predict_files)]
  File "/content/tutorials/auto3dseg/tasks/hecktor22/work_dir/segresnet_0/scripts/infer.py", line 33, in infer
    pred = self.segmenter.infer_image(image_file, save_mask=False)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/content/tutorials/auto3dseg/tasks/hecktor22/work_dir/segresnet_0/scripts/segmenter.py", line 1132, in infer_image
    pred = [post_transforms(x)["pred"] for x in decollate_batch(batch_data)]
  File "/content/tutorials/auto3dseg/tasks/hecktor22/work_dir/segresnet_0/scripts/segmenter.py", line 1132, in <listcomp>
    pred = [post_transforms(x)["pred"] for x in decollate_batch(batch_data)]
  File "/usr/local/lib/python3.8/dist-packages/monai/transforms/compose.py", line 174, in __call__
    input_ = apply_transform(_transform, input_, self.map_items, self.unpack_items, self.log_stats)
  File "/usr/local/lib/python3.8/dist-packages/monai/transforms/transform.py", line 129, in apply_transform
    raise RuntimeError(f"applying transform {transform}") from e
RuntimeError: applying transform <monai.transforms.post.dictionary.Invertd object at 0x7f8e4017bb50>

Can you tell what other changes I need to make in order to obtain inference on test set?
wyli commented 1 year ago

looks like it's an issue of inverting the customised transform HecktorCropNeckRegion, perhaps you can remove the transform in the inference preprocessing.

BSonya commented 1 year ago

Thankyou for your response. . I am now trying to train on the complete Hecktor dataset (30GB) using a single 12GB GPU. However, the training process either terminates due to a VS code crash or takes an excessively long time (8-10 hours per epoch). I have attempted to address this issue by using an additional 12GB RAM module. This increased the training speed when num_workers = 4. However, it still resulted in crashes of VS code or the entire system. When num_workers = 0, the training progresses further, but it remains time-consuming.

Environment:

OS --> ubuntu 22.04.2

Python version --> 3.10.11

MONAI version --> 1.1.0

Pytorch --> 2.0.0+CU117

GPU models and configuration:

(hecktor22) dlrs@spml1:~$ nvidia-smi Tue Jun 20 12:19:47 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:03:00.0 On | N/A | | 0% 41C P5 23W / 275W | 199MiB / 11264MiB | 1% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1024 G /usr/lib/xorg/Xorg 90MiB | | 0 N/A N/A 1310 G /usr/bin/gnome-shell 29MiB | | 0 N/A N/A 1429 G ...mviewer/tv_bin/TeamViewer 21MiB | | 0 N/A N/A 6276 G gnome-control-center 2MiB | | 0 N/A N/A 6417 G ...754442497770707874,262144 52MiB | +-----------------------------------------------------------------------------+

CPU Details:

(hecktor22) dlrs@spml1:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 40 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz CPU family: 6 Model: 26 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: 5 BogoMIPS: 4788.22 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmper f pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm pti ssbd ibrs ibpb sti bp dtherm ida flush_l1d Caches (sum of all):
L1d: 128 KiB (4 instances) L1i: 128 KiB (4 instances) L2: 1 MiB (4 instances) L3: 8 MiB (1 instance) NUMA:
NUMA node(s): 1 NUMA node0 CPU(s): 0-3 Vulnerabilities:
Itlb multihit: KVM: Mitigation: VMX unsupported L1tf: Mitigation; PTE Inversion Mds: Vulnerable: Clear CPU buffers attempted, no microcode; S MT disabled Meltdown: Mitigation; PTI Mmio stale data: Unknown: No mitigations Retbleed: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, RSB f illing, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected ..

wyli commented 1 year ago

sure, are you using this example? https://github.com/Project-MONAI/tutorials/tree/main/auto3dseg/tasks/hecktor22 the latest monai v1.2.0 had various optimizations in terms of training speed and memory usage, please consider upgrading to monai 1.2 pip install -U monai

BSonya commented 1 year ago

Yes, I'm using the same example. I do have a different conda environment where I have installed monai 1.2. see the details below:

(hecktor22a) dlrs@spml1:~$ python -c "import monai; monai.config.print_config()" MONAI version: 1.2.0rc7+13.gf355d1fc Numpy version: 1.24.3 Pytorch version: 2.0.1+cu117 MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False MONAI rev id: f355d1fc38c50e83e4f20ffe870c19a5814dc501 MONAI file: /home/dlrs/MONAI/monai/init.py

Optional dependencies: Pytorch Ignite version: 0.4.11 ITK version: 5.3.0 Nibabel version: 5.1.0 scikit-image version: 0.21.0 Pillow version: 9.5.0 Tensorboard version: 2.13.0 gdown version: 4.7.1 TorchVision version: 0.15.2+cu117 tqdm version: 4.65.0 lmdb version: 1.4.1 psutil version: 5.9.5 pandas version: 2.0.2 einops version: 0.6.1 transformers version: 4.21.3 mlflow version: 2.4.0 pynrrd version: 1.0.0

When I switch to this env, the following error appears:

(hecktor22a) dlrs@spml1:~/Desktop/tutorials/auto3dseg/tasks/hecktor22$ python hecktor22.py 2023-06-20 14:52:24,122 - INFO - AutoRunner using work directory ./work_dir 2023-06-20 14:52:24,126 - INFO - Loading input config input.yaml 2023-06-20 14:52:24,141 - INFO - Datalist was copied to work_dir: /home/dlrs/Desktop/tutorials/auto3dseg/tasks/hecktor22/work_dir/hecktor22_folds.json 2023-06-20 14:52:24,144 - INFO - Setting num_fold 5 based on the input datalist /home/dlrs/Desktop/tutorials/auto3dseg/tasks/hecktor22/work_dir/hecktor22_folds.json. 2023-06-20 14:52:24,269 - INFO - Skipping data analysis... 2023-06-20 14:52:24,269 - INFO - Skipping algorithm generation... Traceback (most recent call last): File "/home/dlrs/Desktop/tutorials/auto3dseg/tasks/hecktor22/hecktor22.py", line 26, in runner.run() File "/home/dlrs/MONAI/monai/apps/auto3dseg/auto_runner.py", line 804, in run self._train_algo_in_sequence(history) File "/home/dlrs/MONAI/monai/apps/auto3dseg/auto_runner.py", line 656, in _train_algo_in_sequence algo.train(self.train_params, self.device_setting) File "/home/dlrs/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 261, in train self.device_setting.update(device_setting) AttributeError: 'SegresnetAlgo' object has no attribute 'device_setting'

wyli commented 1 year ago

@myron could you please help? it seems the tutorial here https://github.com/Project-MONAI/tutorials/tree/main/auto3dseg/tasks/hecktor22 should be updated for v1.2...

BSonya commented 1 year ago

@myron do I need to increase RAM or something? Kindly help

myron commented 1 year ago

@BSonya hi, hold on for 1-2 days, it's fixed, just needs to be updated on github

myron commented 1 year ago

@BSonya so if you want to run it sooner, you can

1) do a clean start (remove the previously generated folders in /content/tutorials/auto3dseg/tasks/hecktor22/) 2) use updated tutorial configs folder https://github.com/Project-MONAI/tutorials/tree/main/auto3dseg/tasks/hecktor22 3) and you'll also need an updated algo template files , so clone/download this folder manually "algorithm_templates" https://github.com/Project-MONAI/research-contributions/tree/main/auto3dseg/algorithm_templates

and when training specify a path to it, e.g.. python -m monai.apps.auto3dseg AutoRunner run --input='./input.yaml' --algos='segresnet' --templates_path_or_url=/path/to/downloaded/algorithm_templates