Scalability of monailabel (OOM errors)

Describe the bug I have encountered two different situations where monai label is using far more memory than I would expect. Are these user errors, or related to my dataset? Has monai label been designed with scalability in mind?

When I push train, my entire dataset is loaded into CPU RAM. Our dataset is larger than some of the competition datasets (BTCV or MSD) but not extremely large - roughly 100 CT scans that are 512x512xH, where H is usually in the range of about 500. Uncompressed, that adds up to nearly 100GB, which leads to the program crashing. Is there an option to avoid loading all data into RAM, and just load it on demand? Perhaps using pre-fetch to avoid creating a bottleneck? Since I am using the segmentation model, which trains on patches, perhaps it would be sufficient to load just the patches into RAM, rather than the full images?

In case it is relevant, my dataset has 12 foreground labels.

My work around solution is to use swap, but obviously that's not ideal.

After training, clicking RUN gives me another OOM error. I tried decreasing the roi_size for my model, but even at 64x64x64 I'm still exceeding the 8GB of GPU VRAM available:

For 128x128x128

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.25 GiB (GPU 0; 7.92 GiB total capacity; 440.46 MiB already allocated; 6.63 GiB free; 610.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

For 96x96x96

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.62 GiB (GPU 0; 7.92 GiB total capacity; 1.20 GiB already allocated; 5.46 GiB free; 1.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

For 64x64x64

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 7.62 GiB (GPU 0; 7.92 GiB total capacity; 1.20 GiB already allocated; 5.46 GiB free; 1.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

It might be expected behaviour for deepedit models to cause an OOM, since they run on the full image. However, I expected that a segmentation model would scale to arbitrary sized images, because it analyses the image in patches. Have I misunderstood something? Or is the stitching of the patches also carried out in the GPU?

To Reproduce Steps to reproduce the behavior:

Get hold of a medium sized dataset with ground truth labels. Put them in a folder structure as expected by monai. Hold back the ground truth labels for at least one for step 6.
Make a copy of the radiology/lib/config/segmentation.py file (e.g. segmentation_custom.py) and modify the foreground classes and roi_size.

Run the monailabel app:

monailabel start_server --app radiology --studies relative/path/to/images --conf models segmentation_custom --conf use_pretrained_model false

In Slicer, connect to the server and click Train.
If you have enough CPU RAM and training completes, click Next Sample to get an unlabelled image and then Run to automatically generate labels.

Expected behavior I expected to be able to train a network and run inference on a dataset with an arbitrary number of arbitrarily sized images.

I've used 128x128x128 patches with nnunet, and been able to run inference on GPUs with only 4GB of VRAM. I'm surprised that an 8GB GPU gets an OOM when trying to run the segmentation network with 64x64x64 patches.

8GB of GPU memory was enough to train the network, so I assumed it would also be enough to run inference.

Screenshots N/A

Environment

Ensuring you use the relevant python executable, please paste the output of:

python -c 'import monai; monai.config.print_debug_info()'

================================
Printing MONAI config...
================================
MONAI version: 1.0.1
Numpy version: 1.23.4
Pytorch version: 1.13.0+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: 8271a193229fe4437026185e218d5b06f7c8ce69
MONAI __file__: /home/chris/Software/monai/venv/lib/python3.8/site-packages/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: 0.4.10
Nibabel version: 4.0.2
scikit-image version: 0.19.3
Pillow version: 9.3.0
Tensorboard version: 2.11.0
gdown version: 4.5.3
TorchVision version: 0.14.0+cu117
tqdm version: 4.64.1
lmdb version: 1.3.0
psutil version: 5.9.4
pandas version: NOT INSTALLED or UNKNOWN VERSION.
einops version: 0.6.0
transformers version: NOT INSTALLED or UNKNOWN VERSION.
mlflow version: NOT INSTALLED or UNKNOWN VERSION.
pynrrd version: 0.4.3

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

================================
Printing system config...
================================
System: Linux
Linux version: Ubuntu 20.04.5 LTS
Platform: Linux-5.14.0-1054-oem-x86_64-with-glibc2.29
Processor: x86_64
Machine: x86_64
Python version: 3.8.10
Process name: python
Command: ['python', '-c', 'import monai; monai.config.print_debug_info()']
Open files: []
Num physical CPUs: 6
Num logical CPUs: 12
Num usable CPUs: 12
CPU usage (%): [16.5, 22.2, 15.4, 25.0, 20.9, 82.1, 13.4, 10.5, 12.3, 12.3, 13.9, 15.2]
CPU freq. (MHz): 1579
Load avg. in last 1, 5, 15 mins (%): [11.6, 10.4, 26.0]
Disk usage (%): 81.0
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 31.0
Available memory (GB): 28.3
Used memory (GB): 2.2

================================
Printing GPU config...
================================
Num GPUs: 1
Has CUDA: True
CUDA version: 11.7
cuDNN enabled: True
cuDNN version: 8500
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
GPU 0 Name: NVIDIA GeForce GTX 1080
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 20
GPU 0 Total memory (GB): 7.9
GPU 0 CUDA capability (maj.min): 6.1

Additional context N/A

Please use persistent dataset instead of cache or smart cache... while training..

Thanks for the suggestion @SachidanandAlle, but it didn't work for me. Is radiology/main.py the right place to make this change? I tried like this, but training still used more than 100GB of RAM/swap.

    # Train
    app.train(
        request={
            "model": args.model,
            "max_epochs": 10,
            "dataset": "PersistentDataset",  # Dataset, PersistentDataset, CacheDataset
            "train_batch_size": 1,
            "val_batch_size": 1,
            "multi_gpu": False,
            "val_split": 0.1,
        },
    )

I wondered because making a change in main.py means it will apply to all datasets, right? I would have thought it would be better to make this a dataset-specific setting, either in my segmentation_custom.py or as a command line parameter?

And... while I was testing that, I also tried changing roi_size to 32x32x32, but I still get the same GPU OOM error.

ok.. first thing you are having only 8GB of GPU on each GPU. following are the questions.

what is the input image size? k.. i see in the first comment.. 512x512xH, where H is usually in the range of about 500
some transform are fast when they are loaded into GPU. for example sampling. if you have larger image, then you should not run this on GPU.
check which transform/operation fails on GPU

It's definitely not about ROI as you have tried to use smaller patches.. you are not able to run something on GPU before that.. some of the pre-transform

Ah right, I probably should have thought to look more closely at the error messages. One line above the OOM message was this line:

.../venv/lib/python3.8/site-packages/monai/inferers/utils.py", line 219, in sliding_window_inference
    output_image_list.append(torch.zeros(output_shape, dtype=compute_dtype, device=device))

I managed to follow that back to the inferer() method in radiology/lib/infers/segmentation.py, and added an argument to the call to SlidingWindowInferer(roi_size=self.roi_size, device=torch.device('cpu')).

Unfortunately, that only postponed the OOM error until the post transforms. Again, following your advice I was able to read through the stack trace to discover that it was the EnsureType type conversion which was trying to load the full image back into GPU memory. I was able to modify that line, and now it runs :-D

            EnsureTyped(keys="pred", device=torch.device('cpu') if data else None),

Thanks for your help!! This lets me run inference, and with my swap workaround I can run training too. Is it worth considering turning this into a feature request for more defensive programming? Perhaps using torch.device('cpu') for these operations unless the user explicitly enables GPU, or if the image size is guaranteed to fit in GPU memory? For nnunet, there is a command line argument --all-in-gpu that serves this purpose. Having it disabled by default removes one potential source of problems.

Ah right, I probably should have thought to look more closely at the error messages. One line above the OOM message was this line:
.../venv/lib/python3.8/site-packages/monai/inferers/utils.py", line 219, in sliding_window_inference
    output_image_list.append(torch.zeros(output_shape, dtype=compute_dtype, device=device))
I managed to follow that back to the inferer() method in radiology/lib/infers/segmentation.py, and added an argument to the call to SlidingWindowInferer(roi_size=self.roi_size, device=torch.device('cpu')).

Unfortunately, that only postponed the OOM error until the post transforms. Again, following your advice I was able to read through the stack trace to discover that it was the EnsureType type conversion which was trying to load the full image back into GPU memory. I was able to modify that line, and now it runs :-D
            EnsureTyped(keys="pred", device=torch.device('cpu') if data else None),
Thanks for your help!! This lets me run inference, and with my swap workaround I can run training too. Is it worth considering turning this into a feature request for more defensive programming? Perhaps using torch.device('cpu') for these operations unless the user explicitly enables GPU, or if the image size is guaranteed to fit in GPU memory? For nnunet, there is a command line argument --all-in-gpu that serves this purpose. Having it disabled by default removes one potential source of problems.

This is a good idea @chrisrapson Would you mind creating a PR or opening an issue proposing this?

The objective of making ALL GPU default is to demonstrate the power of GPUs.. If you have fairly good GPUs, you can see the training cost is cut down a lot.. in many situations, people want to see that happening as default specially when they have bought great GPUs on AWS/Cloud and want to see how fast they can achieve the performance. That's the main reason to keep these configs default to run everything in GPU.

And it is more obvious, we like to promote the GPU computation as part of AI paradigm as much as possible. That is one of the main objectives too. I still try run some basic things on my windows laptop with 6 GB GPU, but this is only for small image/model etc.. Otherwise Life is more complicated if I try to fit everything in low/small GPU.

Ok, I understand the trade-off between showcasing the speed of the GPU vs making the software more robust to larger images and larger datasets. That's a judgement call, so I'll leave it up to you. At a minimum, I'd suggest updating the documentation to highlight this potential issue and how to work around it.

Still, I'll list some arguments in favour of making all-in-gpu an opt-in setting:

If I had a big enough GPU, I would have chosen to run the deepedit model. I switched to the segmentation model specifically because I read that it works on patches and doesn't load the full image into GPU RAM. Assuming that I understood that correctly, would it make sense to keep the deepedit model all-in-gpu by default, and the segmentation model only does the inference for each patch in the GPU?
I don't think our images or our dataset are particularly large, so assume I'm not the only one who will hit this problem. If I was in your position as a dev, I would try to avoid being asked these questions repeatedly, so robustness on all kinds of hardware would be a priority for me, as well as providing instructions for how to get more performance on an opt-in basis.
There are images out there (e.g. high resolution multi-modal MRIs) that will exceed the capacity of even SOTA GPUs, and/or I might achieve better accuracy with some other hyperparameters setting that uses lots of GPU RAM, e.g. increasing the patch size or using a transformer model. Increased accuracy will save me more time overall because I won't have to make as many manual changes.

Finally, I wonder if it would be worth considering a try-catch method, where you try doing everything in GPU and fall back to CPU if necessary?

P.S. I still have the open question as to why changing to a PersistentDataset didn't reduce my RAM usage during training.

you can calculate the dump size of PersistentDataset.. you might notice a difference only if the dump is reasonable.. persistence cached saved into your model/xyz/train_xy folder.. cache or .cache..

and also you can dig details to see how much is loaded into GPU vs CPU.. if your pre-transform is cached after loading data into GPU.. the corresponding tensor gets saved on the disk.. some mem profilers can help to know a bit more.

i understand on supporting gpu vs non-gpu enforcement in some of the examples. it can be a good config.. and for your segmentation_xxx model, you can do something like this..

    def train_pre_transforms(self, context: Context):
        t = [
            LoadImaged(keys=("image", "label")),
            EnsureChannelFirstd(keys=("image", "label")),
            Spacingd(
                keys=("image", "label"),
                pixdim=(1.0, 1.0, 1.0),
                mode=("bilinear", "nearest"),
            ),
            ScaleIntensityRanged(keys="image", a_min=-57, a_max=164, b_min=0.0, b_max=1.0, clip=True),
            CropForegroundd(keys=("image", "label"), source_key="image"),
        ]

        if context.request.get("all-in-gpu", False):
          t.append(EnsureTyped(keys=("image", "label"), device=context.device))

        t.extend([
            RandCropByPosNegLabeld(
                keys=("image", "label"),
                label_key="label",
                spatial_size=(96, 96, 96),
                pos=1,
                neg=1,
                num_samples=4,
                image_key="image",
                image_threshold=0,
            ),
            RandShiftIntensityd(keys="image", offsets=0.1, prob=0.5),
            SelectItemsd(keys=("image", "label")),
        ])

Ok, I understand the trade-off between showcasing the speed of the GPU vs making the software more robust to larger images and larger datasets. That's a judgement call, so I'll leave it up to you. At a minimum, I'd suggest updating the documentation to highlight this potential issue and how to work around it.

Still, I'll list some arguments in favour of making all-in-gpu an opt-in setting:

If I had a big enough GPU, I would have chosen to run the deepedit model. I switched to the segmentation model specifically because I read that it works on patches and doesn't load the full image into GPU RAM. Assuming that I understood that correctly, would it make sense to keep the deepedit model all-in-gpu by default, and the segmentation model only does the inference for each patch in the GPU?

I don't think our images or our dataset are particularly large, so assume I'm not the only one who will hit this problem. If I was in your position as a dev, I would try to avoid being asked these questions repeatedly, so robustness on all kinds of hardware would be a priority for me, as well as providing instructions for how to get more performance on an opt-in basis.

There are images out there (e.g. high resolution multi-modal MRIs) that will exceed the capacity of even SOTA GPUs, and/or I might achieve better accuracy with some other hyperparameters setting that uses lots of GPU RAM, e.g. increasing the patch size or using a transformer model. Increased accuracy will save me more time overall because I won't have to make as many manual changes.

Finally, I wonder if it would be worth considering a try-catch method, where you try doing everything in GPU and fall back to CPU if necessary?

P.S. I still have the open question as to why changing to a PersistentDataset didn't reduce my RAM usage during training.

if you suggestions to support this config for an example app, please feel free to create a PR so that users of both type can enjoy the benefit. may be by default we keep it enabled and if one wants to run on smaller gpus, they can use addition conf option (which will not push some pre-transforms etc on gpu)

I am closing the issue as no further update on this thread. feel free to reopen

Project-MONAI / MONAILabel

Scalability of monailabel (OOM errors) #1175