Open udiram opened 1 year ago
Hi @udiram, I hope this discussion can help you. Thanks!
thanks @KumoLiu setting it to [2,2,2] in the config has it running now, I see that the discussion mentions a possible fix to this in a release? should I be on the weekly branch? just for any other such issues that may crop up.
thanks,
I'm not sure whether the fix has been released. @dongyang0122, could you please help point it out if there is a fix? Thanks in advance!
Describe the bug autorunner completes ['dints_0', 'dints_1', 'dints_2', 'dints_3', 'dints_4', 'segresnet2d_0', 'segresnet2d_1', 'segresnet2d_2', 'segresnet2d_3', 'segresnet2d_4'] training and errors out on segresnet_0
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 386.00 MiB (GPU 0; 40.00 GiB total capacity; 10.45 GiB already allocated; 341.25 MiB free; 10.94 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
To Reproduce Steps to reproduce the behavior: Run Autorunner on AMOS22 dataset
manually resetting cuda cache, restarting kernel and instance all come back to this error.
Expected behavior training proceeds without error MONAI version: 1.2.0 Numpy version: 1.25.2 Pytorch version: 2.0.1+cu117 MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False MONAI rev id: c33f1ba588ee00229a309000e888f9817b4f1934 MONAI file: /home/exouser/.local/lib/python3.10/site-packages/monai/init.py
Optional dependencies: Pytorch Ignite version: 0.4.11 ITK version: 5.3.0 Nibabel version: 5.1.0 scikit-image version: 0.21.0 Pillow version: 9.0.1 Tensorboard version: 2.14.0 gdown version: 4.7.1 TorchVision version: 0.15.2+cu117 tqdm version: 4.66.1 lmdb version: 1.4.1 psutil version: 5.9.0 pandas version: 2.0.3 einops version: 0.6.1 transformers version: 4.21.3 mlflow version: 2.6.0 pynrrd version: 1.0.0
Environment (please complete the following information): OS: ubuntu 22.04 Python 3.10.12 Driver Version: 525.85.05 CUDA Version: 12.0 GRID A100X-40C - 125GB RAM