Project-MONAI / tutorials

MONAI Tutorials
https://monai.io/started.html
Apache License 2.0
1.85k stars 681 forks source link

Auto3Dseg cuda OOM during Ensembling #1505

Open udiram opened 1 year ago

udiram commented 1 year ago

Describe the bug Models have all finished training, and during the ensembling process, cuda runs out of memory.

Reproduce Steps to reproduce the behavior: Run Autorunner on AMOS22 dataset

manually resetting cuda cache, restarting kernel and instance all come back to this error.

Expected behavior training proceeds without error MONAI version: 1.2.0 Numpy version: 1.25.2 Pytorch version: 2.0.1+cu117 MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False MONAI rev id: c33f1ba588ee00229a309000e888f9817b4f1934 MONAI file: /home/exouser/.local/lib/python3.10/site-packages/monai/init.py

Optional dependencies: Pytorch Ignite version: 0.4.11 ITK version: 5.3.0 Nibabel version: 5.1.0 scikit-image version: 0.21.0 Pillow version: 9.0.1 Tensorboard version: 2.14.0 gdown version: 4.7.1 TorchVision version: 0.15.2+cu117 tqdm version: 4.66.1 lmdb version: 1.4.1 psutil version: 5.9.0 pandas version: 2.0.3 einops version: 0.6.1 transformers version: 4.21.3 mlflow version: 2.6.0 pynrrd version: 1.0.0

Environment (please complete the following information): OS: ubuntu 22.04 Python 3.10.12 Driver Version: 525.85.05 CUDA Version: 12.0 GRID A100X-40C - 125GB RAM

image image

I'm happy to provide any other logs to help, this is the second time I've run into this issue, the issue persists after a full kernel restart and RAM clearing.

KumoLiu commented 1 year ago

Hi @dongyang0122, could you please share some comments here? Thanks in advance!

udiram commented 1 year ago

hi @KumoLiu just following up on this, are there any other similar issues I could reference to trouble shoot? thanks!

KumoLiu commented 1 year ago

Hi @udiram, here are some similar issues you could refer to: https://github.com/Project-MONAI/tutorials/discussions/1089 https://github.com/Project-MONAI/tutorials/discussions/975 Thanks!

udiram commented 1 year ago

hi Kumo, #1089 worked for me to get training going, and like I mentioned in that issue, the same fixes (i.e. setting the spacing in swinunetr to 1.5, 1.5, 1.5), so thanks for this!

I am, however, still running into the ensembling issue that doesn't seem to be addressed in #975 specifically. The good thing about the crash happening so late is that the inferences from the test images are indeed saved, but what I am missing is a model.pth to run the model on some ground truth images as I had hoped to do. Do you know if there is a way to extract this? similar to a model trained with the https://github.com/Project-MONAI/tutorials/blob/main/3d_segmentation/swin_unetr_btcv_segmentation_3d.ipynb routine. Once I have that model file, I shouldn't necessarily need to go through the rest of the auto3dseg pipeline.

All of this with the caveat that Auto3dseg doing this automatically without GPU issues would be great!

KumoLiu commented 1 year ago

Hi @udiram, I looked at the source code, and found that the model is saved under "bundle_root/models". https://github.com/Project-MONAI/research-contributions/blob/0cd69f2a64b727ab8103d30512b39a7eb6a09ed3/auto3dseg/algorithm_templates/segresnet/scripts/segmenter.py#L1136 https://github.com/Project-MONAI/research-contributions/blob/0cd69f2a64b727ab8103d30512b39a7eb6a09ed3/auto3dseg/algorithm_templates/segresnet/configs/hyper_parameters.yaml#L2

Thanks!

udiram commented 1 year ago

Thanks @KumoLiu! I'll give it a go!

udiram commented 1 year ago

hi @KumoLiu is there anywhere for me to see which model performed best during training? so I can run inference using that model, I notice that every fold for every model has an associated .pt file but I'm not seeing a global best model/fold.

thanks

KumoLiu commented 1 year ago

Hi @udiram, I think "model.pt" is the best model for each fold. There is also a final model has been saved. You may need to ensemble to get the final result. https://github.com/Project-MONAI/research-contributions/blob/0cd69f2a64b727ab8103d30512b39a7eb6a09ed3/auto3dseg/algorithm_templates/segresnet/scripts/segmenter.py#L1288-L1295 https://github.com/Project-MONAI/research-contributions/blob/0cd69f2a64b727ab8103d30512b39a7eb6a09ed3/auto3dseg/algorithm_templates/segresnet/scripts/segmenter.py#L1136-L1137

udiram commented 1 year ago

Hi @KumoLiu , thanks for the info, so I guess I'm a bit stuck until this ensembling issue is figured out, is there anything else, debugging or log wise, that you or @dongyang0122 need in order to figure it out?

thanks!

KumoLiu commented 1 year ago

Hi @udiram, for how to ensemble, you can refer to: https://github.com/Project-MONAI/tutorials/blob/main/modules/cross_validation_models_ensemble.ipynb https://github.com/Project-MONAI/MONAI/blob/281cb0119c01eaa8e6c841880b91f92f45e8d7f7/monai/apps/auto3dseg/ensemble_builder.py#L404

Thanks!

udiram commented 1 year ago

Hi @KumoLiu

Thanks for the resources, does this integrate into the Auto3dseg pipeline in any way? Is there any ways to point the ensembler at the files generated by auto3dseg?

Thanks

KumoLiu commented 1 year ago

Hi @udiram, yes, it has already been integrated into the AutoRunner. https://github.com/Project-MONAI/MONAI/blob/281cb0119c01eaa8e6c841880b91f92f45e8d7f7/monai/apps/auto3dseg/auto_runner.py#L815

You can also override it by:

runner = AutoRunner(input=input)
runner.set_ensemble_method(ensemble_method_name="AlgoEnsembleBestByFold")

Just FYI: https://github.com/Project-MONAI/tutorials/tree/main/auto3dseg/notebooks

Thanks!

udiram commented 1 year ago

sure, I'll give the over ride a try, do you have any ideas on how to run ensembling with less gpu usage, similar to the fix during validation for #1089 ?

thanks!

udiram commented 1 year ago

Hi @KumoLiu, just following up on this issue!