Closed lukasvanderstricht closed 1 year ago
Hi @lukasvanderstricht,
Thanks for opening the issue here. From the logs I see you're having a memory issue:
File “/opt/conda/lib/python3.7/site-packages/monai/apps/deepedit/transforms.py”, line 99, in call
label = np.zeros(d[key].shape)
numpy.core._exceptions.MemoryError: Unable to allocate 502. MiB for an array with shape (512, 512, 251) and data type float64
I also see that you're using CacheDataset
to load the dataset for training:
[2023-07-13 12:20:32,301] [5998] [MainThread] [INFO] (monailabel.tasks.train.basic_train:353) - Train Request (input): {‘model’: ‘deepedit’, ‘name’: ‘train_01’, ‘pretrained’: 1, ‘device’: ‘cuda’, ‘max_epochs’: 50, ‘early_stop_patience’: -1, ‘val_split’: 0.2, ‘train_batch_size’: 1, ‘val_batch_size’: 1, ‘multi_gpu’: True, ‘gpus’: ‘all’, ‘dataset’: ‘CacheDataset’, ‘dataloader’: ‘ThreadDataLoader’, ‘client_id’: ‘user-xyz’, ‘local_rank’: 0}
Caching the dataset is great because is faster. However, it seems the GPU you have can't cache the number of volumes you are using to train the model.
I'd recommend:
Dataset
The downside of this is that training is slower.
CacheDataset
but train the model by batches of a size that your GPU allowsJust to understand a bit more about the use case here:
1/ How many labels are you trying to segment (https://github.com/Project-MONAI/MONAILabel/blob/main/sample-apps/radiology/lib/configs/deepedit.py#L42-L51)?
2/ Did you change the default volume size (https://github.com/Project-MONAI/MONAILabel/blob/main/sample-apps/radiology/lib/configs/deepedit.py#L77)?
Let us know
Hi @diazandr3s
Thank you for your answer! I am indeed able to execute the training process when I switch from CacheDataset to Dataset. The downside is indeed that this switch entails a significant change in the time the training takes, so I was wondering if there was any way to fix it while still using CacheDataset. I will try the option of using larger batches, thank you for the suggestion!
Here some more background: 1/ I have 9 labeled images with 3 labels each (one of which is the background label)
2/I have not changed anything about the volume size yet
Hope this helps!
Thank you again for your answer!
Thanks for the reply, @lukasvanderstricht
With regards to this:
I will try the option of using larger batches, thank you for the suggestion!
I meant you train the model on the number of volumes your GPU can cache. Then retrain on the other volumes. Keep using the default batch size of 1.
Hi,
out of curiosity - have you tried using PersistentDataset
(still with batch size of 1)? In my experience that also results in a nice speedup, especially if source volumes are compressed (e.g. .nii.gz) or if compute-heavy pre-processing happens (e.g. resampling, which is the case by default in the radiology app, especially at 512x512xN). Could be worth a try.
Hi @nvahmadi
Thanks for the suggestion! It indeed also seems to work, but I don't see a major difference in speed when compared to Dataset
. Thanks for your reply, though!
Kind regards
Thanks for reporting back, and interesting to note. Not sure why, but for me the speedup was drastic, it was comparable to CacheDataset
. Perhaps two reasons: 1) I cache to NVMe drives and 2) the first epoch will still be perceptibly as slow as Dataset
as every sample that gets encountered for the first time needs to be pre-processed and written to disk first. The speed-up should become noticeable in epochs 2 and up though. Did you let it run beyond epoch 1 and are you caching to NVMe drives as well?
I indeed let it run further than 1 epoch but it still remains as slow as Dataset. I don't cache to NVMe drives though.
Ok good to know, thanks. One note - I just remembered that I made this experience in context of MONAI Core and on larger batch sizes. I'd need to try myself whether I get similar speed-ups in MONAI Label and e.g. batch sizes of 1. Sorry for the confusion!
Closing this issue
Dear all
I am currently using 3D Slicer and its MONAILabel extension to train a segmentation model using the DeepEdit model from the predefined radiology app. Both manual segmentation and training have been going smoothly up till now and the automatic segmentation functionality seems to be doing its job. However, when I want to further train the model at this point, without having added any new labels (so just starting the training process again), I always get one of the two following errors
It seems weird to me that without changing anything (such as adding new labels), the training suddenly starts to systematically fail while it was working fine before. Does anyone have any clue as to why these errors occur?
Thanks in advance!