Closed Nic-Ma closed 2 years ago
Hi @mingxin-zheng ,
I think fill_template_config()
of the base class should be an abstract
class with raise NotImplementedError
as we suppose the subclass must implement it:
https://github.com/Project-MONAI/MONAI/blob/dev/monai/apps/auto3dseg/bundle_gen.py#L130
Refer to:
https://github.com/Project-MONAI/MONAI/blob/dev/monai/engines/workflow.py#L293
And some doc-strings miss the full stop.
Thanks.
Please describe the expected structure and APIs of the InferClass
in the docstring:
https://github.com/Project-MONAI/MONAI/blob/dev/monai/apps/auto3dseg/bundle_gen.py#L238
Thanks.
Please change the algo_zip
to default_algo_zip
, better to make it an init option in the class.
Thanks.
Seems the *args, **kwargs
are not used in the function, should we remove them?
https://github.com/Project-MONAI/MONAI/blob/dev/monai/apps/auto3dseg/bundle_gen.py#L374
CC @wyli .
Thanks.
If config files in the templates have the same keys, the generated config file has overwriting issue.
please use absolute paths in bundle root, data list, data root when generating the configs. otherwise the bundle cannot run inside bundle folder.
it seems that after algo generation, the fold value is always 0 without overwriting?
Reference from chatting with @tangy5 . In one experiment with CT images (512x512x100) SegResNet2D training reported dimension error. The problem may relate to slicing transformation but more details need to be provided.
it seems that after algo generation, the fold value is always 0 without overwriting?
Same question here, now our design to generate single bundle for all 5 fold experiments right? How will fold determined.
Reference from chatting with @tangy5 . In one experiment with CT images (512x512x100) SegResNet2D training reported dimension error. The problem may relate to slicing transformation but more details need to be provided.
Yup, run into some errors for the 2d model with BTCV multi-organ segmentation task (14 classes). I will provide more details on the 2d model tomorrow.
For parameters that change depending on whether there are multiple gpus or not, the current practice is to generate them separately and store them in 10 folders(if 5-fold cv). Can we store both multigpu and single gpu parameters and then use the hyperparameters to choose which one to use when training?
Currently for multi-channel data,DataAnalyzer
is taking the average of multiple channels (e.g. intensity), perhaps the value of each channel should be put in a list. If like that, algo.py in each algorithem maybe also need to be updated.
@dongyang0122 @tangy5 The current folder design is to generate one bundle for each fold. So there will be 5 folders for each network (unet_0, unet_1, ...).
Hi @mingxin-zheng, I think what @dongyang0122 and @tangy5 means is that the
fold
value in "algo_config.yaml" will not be overwrited, which would make some mistakes when get data based on fold and some configs based onfold
.
https://github.com/KumoLiu/research-contributions/blob/683de594fd76a23f4d8e37c8d56aa4638d523d3c/auto3dseg/algo_templates/segresnet/scripts/train.py#L88-L93 https://github.com/KumoLiu/research-contributions/blob/683de594fd76a23f4d8e37c8d56aa4638d523d3c/auto3dseg/algo_templates/segresnet/configs/hyper_parameters.yaml#L4
Agree with @KumoLiu . The root causes are (1) BundleGen function generate
is not passing the fold_idx and (2) fill_template_config
in algo.py
is not accepting override params
I created a PR #5087 to fix this issue with 2 other issues in BundleGen.
I suggest to add metadata.json for every algo template for future extension and management.
Thanks.
When train multi-gpu with command
torchrun --nnodes=1 --nproc_per_node=2 scripts/train.py run --config_file configs/algo_config.yaml
Stuck after few epochs with error below:
RuntimeError: DataLoader worker (pid(s) 1052) exited unexpectedly
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 412 closing signal SIGTERM
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
There are no input.yaml
in any folder which required to be passed in as an argument to data_src_cfg_name
in BundleGen
, which may be needed to added in each template.
@KumoLiu do you mean the data_src_cfg ?
@KumoLiu do you mean the data_src_cfg ?
Yes, I mean this one. https://github.com/Project-MONAI/MONAI/blob/b1aa40dc70ab6e2f75ddf3327da97102bdd7546a/monai/apps/auto3dseg/bundle_gen.py#L315
BTCV Benchmarking Progress:
SwinUNETR and SegResNet 5 fold running:
Fig: fold 1 Demo. Red: SwinUNETR, Blue: SegResNet
MSD Task 04 was completed earlier today. All models except SwinUNETR were trained successfully on the hippocampus datasets. Because of the fold index bug, only fold 0 was trained and repeated 5 times. The validation best metrics are in the range of 0.88-0.90.
Reference from chatting with @tangy5 . In one experiment with CT images (512x512x100) SegResNet2D training reported dimension error. The problem may relate to slicing transformation but more details need to be provided.
Yup, run into some errors for the 2d model with BTCV multi-organ segmentation task (14 classes). I will provide more details on the 2d model tomorrow.
@tangy5 I'm investigating this issue as well. Did the error prompt in the beginning of training, or after a few epochs? Please also share the data source config (aka "input.yaml") here. Thanks
I have finished training 15 models for the MSD Task05 Prostate. Performance looks reasonable (shown below). And the 2D model works best as expected.
All algos are done, SegResNet works the best, the 2d model performs worse as expected due to small organs such as adrenal gland. Thanks
the config now supports _desc_
keyword for some textual descriptions for readability, so it might be useful to include some comments in the templates:
image_key: image
label_key: label
network:
_target_: UNet
_desc_: "my testing network with batchnorm"
spatial_dims: 3
in_channels: 1
out_channels: 2
channels: [16, 32, 64, 128, 256]
strides: [2, 2, 2, 2]
num_res_units: 2
norm: batch
Hi @mingxin-zheng @dongyang0122 ,
For the next release, the main new feature is applying experiment management in it. Tracking in the ticket: https://github.com/Project-MONAI/MONAI/issues/4903, so let's close this ticket.
Thanks.
Is your feature request related to a problem? Please describe. As we already merged the first version implementation, will quickly test it and refine soon, this ticket is used to track the feedback and task items for the improvement.