CUDA runs out of memory

hg6185 commented 1 year ago

I am currently training only on 1 GPU (Nvidia V100, unfortunately 16Gb VRAM) with Batch-Size 1. Unfortunately, my images are relatively large with ~2.5kx2k pixels. After ca 1200 iterations I encounter the following error:

RuntimeError: CUDA out of memory. Tried to allocate 74.00 MiB (GPU 0; 15.77 GiB total capacity; 13.90 GiB already allocated; 7.88 MiB free; 14.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

My Code to load the dataset:

dataloader.train = L(build_detection_train_loader)(
        dataset=L(get_detection_dataset_dicts)(names="ffb_c_train"),
        mapper=L(DetrDatasetMapper)(
            augmentation=[
                L(T.RandomFlip)(),
                L(T.ResizeShortestEdge)(
                    short_edge_length=(480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800),
                    max_size=1333,
                    sample_style="choice",
                ),
            ],
            augmentation_with_crop=[
                L(T.RandomFlip)(),
                L(T.ResizeShortestEdge)(
                    short_edge_length=(400, 500, 600),
                    sample_style="choice",
                ),
                L(T.RandomCrop)(
                    crop_type="absolute_range",
                    crop_size=(384, 600),
                ),
                L(T.ResizeShortestEdge)(
                    short_edge_length=(480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800),
                    max_size=1333,
                    sample_style="choice",
                ),
            ],
            is_train=True,
            mask_on=False,
            img_format="RGB",
        ),
        total_batch_size=1,
        num_workers=2,
    )

I use this code (from the example) to load my dataset. Now I have two questions since I am not an expert on parallel computing:

Regarding the batch size, does it change anything if I hook up the second GPU (since batch_size is already 1)?
Detrex is able to resize my images and labels: Does this still create a memory overhead because maybe the image are still saved on the gpu?

I have a final question: The images that were provided are also in a Grayscale and jpg. From the source code I saw that when they are loaded by pillow they are changed TO the img_format that is set by default to "RGB". Do I have to change something here?

Thanks in Advance!

rentainhe commented 1 year ago

Hello!

total_batch_size is the batch size among all the used GPUs, for example, if you set total_batch_size=2 and use two GPUs, there will be one data on each GPU.

Does this still create a memory overhead because maybe the image are still saved on the GPU

I think it's better for you to print the input data size during training for debugging this, it will resize the input image to make sure the longest edge is not longer than 1333

The images that were provided are also in a Grayscale and jpg

I did not try some experiments on Grayscale image, I will double check it, and the jpg format is OK

hg6185 commented 1 year ago

Hi! Thanks for the quick response! I started logging the image sizes, the resizing works. I will see, if this helps me to resolve memory issues.

rentainhe commented 1 year ago

Hi! Thanks for the quick response! I started logging the image sizes, the resizing works. I will see, if this helps me to resolve memory issues.

Yes, you can also try to use gradient-checkpoint to solve the memory issue, we've already supported this function in DINO

hg6185 commented 1 year ago

I will keep this in mind. Thank you very much!

Sofar, I helped myself by freezing the backbone. (https://arxiv.org/abs/2204.00484) I observed similar effects like Vasconcelos et al., while however it seems that especially small objects suffer from the lack of fine-tuning of the backbone.

IDEA-Research / detrex

CUDA runs out of memory #298