Project-MONAI / MONAI

AI Toolkit for Healthcare Imaging
https://monai.io/
Apache License 2.0
5.71k stars 1.04k forks source link

Improve the auto3dseg implementation #5085

Closed Nic-Ma closed 2 years ago

Nic-Ma commented 2 years ago

Is your feature request related to a problem? Please describe. As we already merged the first version implementation, will quickly test it and refine soon, this ticket is used to track the feedback and task items for the improvement.

Nic-Ma commented 2 years ago

Hi @mingxin-zheng ,

I think fill_template_config() of the base class should be an abstract class with raise NotImplementedError as we suppose the subclass must implement it: https://github.com/Project-MONAI/MONAI/blob/dev/monai/apps/auto3dseg/bundle_gen.py#L130 Refer to: https://github.com/Project-MONAI/MONAI/blob/dev/monai/engines/workflow.py#L293

And some doc-strings miss the full stop.

Thanks.

Nic-Ma commented 2 years ago

Please describe the expected structure and APIs of the InferClass in the docstring: https://github.com/Project-MONAI/MONAI/blob/dev/monai/apps/auto3dseg/bundle_gen.py#L238

Thanks.

Nic-Ma commented 2 years ago

Please change the algo_zip to default_algo_zip, better to make it an init option in the class.

Thanks.

Nic-Ma commented 2 years ago

Seems the *args, **kwargs are not used in the function, should we remove them? https://github.com/Project-MONAI/MONAI/blob/dev/monai/apps/auto3dseg/bundle_gen.py#L374 CC @wyli .

Thanks.

dongyang0122 commented 2 years ago

If config files in the templates have the same keys, the generated config file has overwriting issue.

dongyang0122 commented 2 years ago

please use absolute paths in bundle root, data list, data root when generating the configs. otherwise the bundle cannot run inside bundle folder.

dongyang0122 commented 2 years ago

it seems that after algo generation, the fold value is always 0 without overwriting?

mingxin-zheng commented 2 years ago

Reference from chatting with @tangy5 . In one experiment with CT images (512x512x100) SegResNet2D training reported dimension error. The problem may relate to slicing transformation but more details need to be provided.

tangy5 commented 2 years ago

it seems that after algo generation, the fold value is always 0 without overwriting?

Same question here, now our design to generate single bundle for all 5 fold experiments right? How will fold determined.

tangy5 commented 2 years ago

Reference from chatting with @tangy5 . In one experiment with CT images (512x512x100) SegResNet2D training reported dimension error. The problem may relate to slicing transformation but more details need to be provided.

Yup, run into some errors for the 2d model with BTCV multi-organ segmentation task (14 classes). I will provide more details on the 2d model tomorrow.

KumoLiu commented 2 years ago

For parameters that change depending on whether there are multiple gpus or not, the current practice is to generate them separately and store them in 10 folders(if 5-fold cv). Can we store both multigpu and single gpu parameters and then use the hyperparameters to choose which one to use when training?

KumoLiu commented 2 years ago

Currently for multi-channel data,DataAnalyzer is taking the average of multiple channels (e.g. intensity), perhaps the value of each channel should be put in a list. If like that, algo.py in each algorithem maybe also need to be updated.

mingxin-zheng commented 2 years ago

@dongyang0122 @tangy5 The current folder design is to generate one bundle for each fold. So there will be 5 folders for each network (unet_0, unet_1, ...).

KumoLiu commented 2 years ago

Hi @mingxin-zheng, I think what @dongyang0122 and @tangy5 means is that the fold value in "algo_config.yaml" will not be overwrited, which would make some mistakes when get data based on fold and some configs based on fold.

https://github.com/KumoLiu/research-contributions/blob/683de594fd76a23f4d8e37c8d56aa4638d523d3c/auto3dseg/algo_templates/segresnet/configs/hyper_parameters.yaml#L10

https://github.com/KumoLiu/research-contributions/blob/683de594fd76a23f4d8e37c8d56aa4638d523d3c/auto3dseg/algo_templates/segresnet/scripts/train.py#L88-L93 https://github.com/KumoLiu/research-contributions/blob/683de594fd76a23f4d8e37c8d56aa4638d523d3c/auto3dseg/algo_templates/segresnet/configs/hyper_parameters.yaml#L4

mingxin-zheng commented 2 years ago

Agree with @KumoLiu . The root causes are (1) BundleGen function generate is not passing the fold_idx and (2) fill_template_config in algo.py is not accepting override params

I created a PR #5087 to fix this issue with 2 other issues in BundleGen.

Nic-Ma commented 2 years ago

I suggest to add metadata.json for every algo template for future extension and management.

Thanks.

KumoLiu commented 2 years ago

When train multi-gpu with command torchrun --nnodes=1 --nproc_per_node=2 scripts/train.py run --config_file configs/algo_config.yaml Stuck after few epochs with error below:

RuntimeError: DataLoader worker (pid(s) 1052) exited unexpectedly
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 412 closing signal SIGTERM
Traceback (most recent call last):
 File "/opt/conda/bin/torchrun", line 33, in <module>
  sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
 File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
  return f(*args, **kwargs)
 File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
  run(args)
 File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
  elastic_launch(
 File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
  return launch_agent(self._config, self._entrypoint, list(args))
 File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
  raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
KumoLiu commented 2 years ago

There are no input.yaml in any folder which required to be passed in as an argument to data_src_cfg_name in BundleGen, which may be needed to added in each template.

mingxin-zheng commented 2 years ago

@KumoLiu do you mean the data_src_cfg ?

KumoLiu commented 2 years ago

@KumoLiu do you mean the data_src_cfg ?

Yes, I mean this one. https://github.com/Project-MONAI/MONAI/blob/b1aa40dc70ab6e2f75ddf3327da97102bdd7546a/monai/apps/auto3dseg/bundle_gen.py#L315

tangy5 commented 2 years ago

BTCV Benchmarking Progress:

SwinUNETR and SegResNet 5 fold running:

Screen Shot 2022-09-05 at 11 04 15 AM

Fig: fold 1 Demo. Red: SwinUNETR, Blue: SegResNet

Full run: https://tensorboard.dev/experiment/91yiQ3XqRNm53dSGgRcwCA/#scalars&_smoothingWeight=0&runSelectionState=eyJTd2luVU5FVFIvbW9kZWxfZm9sZDAvRXZlbnRzIjpmYWxzZX0%3D&regexInput=1

mingxin-zheng commented 2 years ago

MSD Task 04 was completed earlier today. All models except SwinUNETR were trained successfully on the hippocampus datasets. Because of the fold index bug, only fold 0 was trained and repeated 5 times. The validation best metrics are in the range of 0.88-0.90.

mingxin-zheng commented 2 years ago

Reference from chatting with @tangy5 . In one experiment with CT images (512x512x100) SegResNet2D training reported dimension error. The problem may relate to slicing transformation but more details need to be provided.

Yup, run into some errors for the 2d model with BTCV multi-organ segmentation task (14 classes). I will provide more details on the 2d model tomorrow.

@tangy5 I'm investigating this issue as well. Did the error prompt in the beginning of training, or after a few epochs? Please also share the data source config (aka "input.yaml") here. Thanks

dongyang0122 commented 2 years ago

I have finished training 15 models for the MSD Task05 Prostate. Performance looks reasonable (shown below). And the 2D model works best as expected.

image

tangy5 commented 2 years ago

Screenshot from 2022-09-07 19-38-20

All algos are done, SegResNet works the best, the 2d model performs worse as expected due to small organs such as adrenal gland. Thanks

wyli commented 2 years ago

the config now supports _desc_ keyword for some textual descriptions for readability, so it might be useful to include some comments in the templates:

image_key: image
label_key: label
network:
  _target_: UNet
  _desc_: "my testing network with batchnorm"
  spatial_dims: 3
  in_channels: 1
  out_channels: 2
  channels: [16, 32, 64, 128, 256]
  strides: [2, 2, 2, 2]
  num_res_units: 2
  norm: batch
Nic-Ma commented 2 years ago

Hi @mingxin-zheng @dongyang0122 ,

For the next release, the main new feature is applying experiment management in it. Tracking in the ticket: https://github.com/Project-MONAI/MONAI/issues/4903, so let's close this ticket.

Thanks.