Project-MONAI / tutorials

MONAI Tutorials
https://monai.io/started.html
Apache License 2.0
1.85k stars 681 forks source link

swinUNETR not working due to out-of-memory error on Google Colab #1258

Closed kengoto1112 closed 1 year ago

kengoto1112 commented 1 year ago

I am trying to use swinUNETR with the tutorial on swin_unetr_btcv_segmentation_3d.ipynb. However, I am facing an out-of-memory error on Google Colaboratory's free GPU, and the notebook is not running.

The error occurs at the following lines :

max_iterations = 5
eval_num = 5
post_label = AsDiscrete(to_onehot=14)
post_pred = AsDiscrete(argmax=True, to_onehot=14)
dice_metric = DiceMetric(include_background=True, reduction="mean", get_not_nans=False)
global_step = 0
dice_val_best = 0.0
global_step_best = 0
epoch_loss_values = []
metric_values = []
while global_step < max_iterations:
    global_step, dice_val_best, global_step_best = train(global_step, train_loader, dice_val_best, global_step_best)
model.load_state_dict(torch.load(os.path.join(root_dir, "best_metric_model.pth")))

Training (5 / 5 Steps) (loss=2.98188):  21%|██        | 5/24 [00:04<00:14,  1.30it/s]
Validate (X / X Steps) (dice=X.X):   0%|          | 0/6 [00:00<?, ?it/s]
Training (5 / 5 Steps) (loss=2.98188):  21%|██        | 5/24 [00:04<00:18,  1.04it/s]
---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
[<ipython-input-16-9717af412448>](https://localhost:8080/#) in <module>
     10 metric_values = []
     11 while global_step < max_iterations:
---> 12     global_step, dice_val_best, global_step_best = train(global_step, train_loader, dice_val_best, global_step_best)
     13 model.load_state_dict(torch.load(os.path.join(root_dir, "best_metric_model.pth")))

15 frames
[/usr/local/lib/python3.9/dist-packages/monai/networks/nets/swin_unetr.py](https://localhost:8080/#) in forward(self, x, mask)
    496         ].reshape(n, n, -1)
    497         relative_position_bias = relative_position_bias.permute(2, 0, 1).contiguous()
--> 498         attn = attn + relative_position_bias.unsqueeze(0)
    499         if mask is not None:
    500             nw = mask.shape[0]

OutOfMemoryError: CUDA out of memory. Tried to allocate 1.80 GiB (GPU 0; 14.75 GiB total capacity; 11.31 GiB already allocated; 638.81 MiB free; 13.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have tried adjusting PYTORCH_CUDA_ALLOC_CONF by setting it to max_split_size_mb:128, but I still face the same error.

To try and resolve this, I have reduced max_iterations and eval_num to 5. Additionally, I have reduced the num_workers for CacheDataset to 1 and the num_samples in the "Setup transforms for training and validation" part to 1.

I would appreciate any suggestions or solutions to resolve this issue. Thank you.

tangy5 commented 1 year ago

You could try this to reduce memory usage if the machine resources are limited:

Reduce the number of samples: num_samples = 4 in

        RandCropByPosNegLabeld(
            keys=["image", "label"],
            label_key="label",
            spatial_size=(96, 96, 96),
            pos=1,
            neg=1,
            num_samples=num_samples,
            image_key="image",
            image_threshold=0,
        ),

This will decrease the actual batch size during training:

kengoto1112 commented 1 year ago

Thank you for your response and suggestion. However, we have already reduced num_samples to 1 in the code as follows.

num_samples = 1

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID" device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

train_transforms = Compose( [ LoadImaged(keys=["image", "label"], ensure_channel_first=True), ScaleIntensityRanged( keys=["image"], a_min=-175, a_max=250, b_min=0.0, b_max=1.0, clip=True, ), CropForegroundd(keys=["image", "label"], source_key="image"), Orientationd(keys=["image", "label"], axcodes="RAS"), Spacingd( keys=["image", "label"], pixdim=(1.5, 1.5, 2.0), mode=("bilinear", "nearest"), ),

Do you have any other suggestions or recommendations that we could try to further reduce memory usage or address this issue? Thank you for your help.

tangy5 commented 1 year ago

what's the GPU memory on your side, SwinUNETR is a relative larger model, it probably needs at least 16G for a smooth train/validation/test, but it also depends on the data dimensions.

kengoto1112 commented 1 year ago

Thank you for your response and for letting us know about the memory requirements of SwinUNETR. Our GPU memory is on Tesla T4(16GB)from Google Colaboratory(free). However, I am following the tutorial on Google Colab, which is supposed to provide free GPU resources. Therefore, I find it frustrating that the tutorial is not working as expected on a platform that is advertised to support this type of workload. Even if the BTCV dataset is quite large, I can't get the tutorial does not work on free Google Colab.

tangy5 commented 1 year ago

No worries, thanks again on trying the SwinUNETR model and the tutorial. Can you have a final try of these:

  1. Double check whether the batch size = 1 and the number_smaple = 1.
  2. Double check if the Out-of-memory (OOM) issue occurs at validation, if OOM in validation, you could set the slidingWindowInferce's sw_device to CPU following the MONAI doc API.
  3. If above steps are still not working, you could do this:

Hope these helps.

kengoto1112 commented 1 year ago

I changed 96x96x96 to 64x64x64, and it worked. Thank you so much for your kind support.