Possible Memory leak with multi gpu training?

twmht commented 1 year ago

Version

1.21

Describe the bug.

I used the code from tutorial to train ImageNet (https://github.com/NVIDIA/DALI/blob/main/docs/examples/use_cases/pytorch/resnet50/main.py) , I have six 1080ti gpus.

However, the memory comsumption of 6th gpu was always larger and increasing when training, and would throw the OOM exception in the middle of training.

for example, here are my nvidia smi, the memory comsumption of 6th gpu was larger when compared to others.

Any idea?

Minimum reproducible example

Ref https://github.com/NVIDIA/DALI/blob/main/docs/examples/use_cases/pytorch/resnet50/main.p

Relevant log output

No response

Other/Misc.

No response

Check for duplicates

[X] I have searched the open bugs/issues and have found no duplicates for this bug report

twmht commented 1 year ago

Here are related logs after training over the half of the epochs

twmht commented 1 year ago

Now I upgrade to latest version of dali v1.30.0 to see if the problem still there.

it seems like the memory comsumption is much normal than v1.21

JanuszL commented 1 year ago

Hi @twmht,

You can find all changes introduced in the recent DALI releases here, we fixed at least one memory leak detected. Also regarding uneven memory consumption, as DALI uses memory pools, when for a given GPU the memory usage crosses a given threshold, another chunk is allocated and that is why one GPU can use more than the others (the memory consumed by randomly formed batches of samples could be just higher). You can also consider reducing the batch size to reduce consumed memory.

twmht commented 1 year ago

another chunk is allocated and that is why one GPU can use more than the others

@JanuszL

Very appreciated for the explanation, So does this situation still exist in the latest version?

JanuszL commented 1 year ago

@twmht - if you are asking if DALI can unequally consume the memory between GPUs, then the answer is yes. However, as the training progresses, due to entropy the memory consumption peak should be reached on each GPU the size of the allocated pools should equalize. When it comes to the overall consumption, the best you can do is reduce the batch size/number of CPU threads used (as the mixed image decoder uses CPU and creates one decoder instance per thread, and each of it needs a separate scratch space memory).

twmht commented 1 year ago

@JanuszL

It's not easy to determine the batch size, because it used almost equally memory across gpus in the beggining. everying is fine unitl we haved trained tens of epochs.

like

However, as the training proceeds to tens of the epochs, the memory usage of 6th gpu increased a lot.

like

the training is still in progress but i guess finally it would still raise the exception

is this still normal? I have upgraded to v1.30.0.

JanuszL commented 1 year ago

@twmht that is not expected from the DALI side. Could you run the DALI pipeline alone without the training and see if that memory growth still occurs to rule out the DL FW itself?

twmht commented 1 year ago

that sounds a good try.

twmht commented 1 year ago

the training is still in progress but i guess finally it would still raise the exception

it did.

twmht commented 1 year ago

@JanuszL

Even after detaching the training part, I still observed abnormal memory allocation after several epochs

JanuszL commented 1 year ago

@twmht, that is a good lead. Can you tell us which data set you are using for the test so we can reproduce it on our side? Is it ImageNet?

twmht commented 1 year ago

@JanuszL

I use ImageNet for training and convert it to MXNet Record format (using https://github.com/apache/mxnet/blob/master/tools/im2rec.py). I'm utilizing the DALIGenericIterator (https://docs.nvidia.com/deeplearning/dali/user-guide/docs/plugins/pytorch_plugin_api.html#nvidia.dali.plugin.pytorch.DALIGenericIterator), and for the pipeline, I reference everything except the Reader from https://github.com/NVIDIA/DALI/blob/main/docs/examples/use_cases/pytorch/resnet50/main.py.

And I'm using six 1080ti GPUs. Abnoraml memory allocation issues always occur on the sixth GPU.

Here are the RecordIO file reader

device_id, num_gpus = get_dist_info()
images, labels = fn.readers.mxnet(
        path=[db_path + '/train.rec'],
        index_path=[db_path + '/train.idx'],
        random_shuffle=True,
        initial_fill=32768,
        pad_last_batch=True,
        shard_id=device_id,
        num_shards=num_gpus,
       name='Reader')

Pytorch version is 1.12.1 and cuda is 11.3.

szkarpinski commented 1 year ago

@twmht , thank you for your answer. We'll try to reproduce the issue

szkarpinski commented 1 year ago

@twmht We've reproduced the issue and observed the sudden increase in memory consumption after multiple epochs, just as you described. However, it turns out not to be a bug.

There are two DALI pipelines (train and val) in the code and they both need to allocate some memory: our measurements show that in this case it's ~4GB per pipeline. Most of the time they run one after another, in which case each pipeline can reuse part of the memory of the other one. However, due to asynchronous nature of the GPU it's possible that the pipelines overlap slightly and in such case DALI memory pool has to grow significantly to have enough memory for both pipelines. DALI memory pool hogs the memory, meaning that once it grows it doesn't return memory back to the GPU and this is why we observe a sudden and permanent increase in memory usage that looks like a leak.

To sum up: a single pipeline uses ~4GB of memory, so it's expected for two pipelines to use ~8GB. DALI is trying to reuse the memory so the initial memory usage might be lower for tens of epochs, but eventually it will increase and reach ~8GB.

The best way to reduce the memory consumption is to reduce the batch size, but I can also suggest the following alternative ways to reduce memory usage of your code:

You can limit the size of GPU prefetching queue (see prefetch_queue_depth : https://docs.nvidia.com/deeplearning/dali/archives/dali_0210_beta/dali-developer-guide/docs/pipeline.html?highlight=prefetch_queue_depth#nvidia.dali.pipeline.Pipeline)
You can try to run the validation pipeline on the CPU
You can reduce the number of CPU threads used. Each thread has its own instance of image decoder and each instance of image decoder has its own memory allocated on the GPU.

If you'll have any further questions, we'll be happy to help!

twmht commented 1 year ago

@JanuszL

Thank you!

In fact I had already used cpu for validation pipeline, Still raised OOM aftert training several epochs. i will try other suggessions you provided.

szkarpinski commented 1 year ago

@twmht It's surprising that even with validation pipeline on the CPU you still got OOM. Can you please share with us the exact python code that you're using? We'll try again to reproduce your problem and investigate it further.

twmht commented 1 year ago

@szkarpinski , I use the validation pipeline from mmpretrain(https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/datasets/imagenet_bs32.py#L21), and I just use mmpretrain code to train ImageNet. Because mmpretrain does not support the DALI data loader, I wrote my own, and it is identical to your sample code(https://github.com/NVIDIA/DALI/blob/main/docs/examples/use_cases/pytorch/resnet50/main.py).

You can refer the code here (https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/transforms/processing.py#L781)

You can directly use their provided API to create a PyTorch DataLoader(https://mmengine.readthedocs.io/en/latest/tutorials/dataset.html).

But I believe the issue should not be with the validation dataloader because it is unrelated to DALI. What do you think?

And to the best of my recollection, I remember that earlier versions of DALI didn't have this issue, at least not to my knowledge, but I can't recall the specific version.

szkarpinski commented 1 year ago

@twmht , thank you for your response. I agree that CPU-only DALI-independent validation pipeline shouldn't cause any problems.

If I understand correctly, to reproduce your setup I should:

Convert imagenet to mxnet record format using the tool you linked above
Take our resnet sample and: remove pytorch code, replace the reader with mxnet reader, remove validation pipeline.

Is that correct?

I'd also like to confirm with you that the screenshot with 6GB used on sixth GPU and 4GB on the rest was taken when running without pytorch and without DALI validation pipeline. In other words, when running the repro described above I should expect to see memory grow to at least 6GB. Is that right?

twmht commented 1 year ago

@szkarpinski

Convert imagenet to mxnet record format using the tool you linked above

Yes

Take our resnet sample and: remove pytorch code, replace the reader with mxnet reader, remove validation pipeline.

More precisely, to remove the PyTorch training code and have it solely retrieve batches from the dataloader

I'd also like to confirm with you that the screenshot with 6GB used on sixth GPU and 4GB on the rest was taken when running without pytorch and without DALI validation pipeline. In other words, when running the repro described above I should expect to see memory grow to at least 6GB. Is that right?

I still use pytorch because I have to use DALIGenericIterator to build pytorch dataloader.

by the way, now I use 4 2080ti gpus, the memory allocation issue has been significantly alleviated. , I'm not sure if it's related to the number of GPUs being used.

szkarpinski commented 1 year ago

@twmht I confirm that the problem is still reproducible even with one pipeline. We'll continue our investigation and let you know once we have some more information.

szkarpinski commented 1 year ago

@twmht I realized that those sudden increases in memory consumption after tens of epochs are statistically explainable and are due to the usage of image_random_crop.

In the pipeline, we use image_random_crop(..., random_area=[0.1, 1]). This means that we want the cropped image to be between 10% and 100% of the original size. DALI samples the actual fraction uniformly from that range. This means that memory used by each image has uniform distribution.

Images are processed in batches of 256. The total memory used by a batch (which is just the sum of memory used by each image) has normal distribution. This means that medium-size batches are the most frequent, while really big batches are very rare. It might take tens of thousands of iterations to get such really big batch, and once we get it, the memory usage grows suddenly. Our calculations and simulations based on the above description match the experimental results.

This means that the memory usage might grow from time to time (sometimes very rapidly) for tens of epochs until for some batch the pesimistic case is reached and all sampled areas are close to 1. You can estimate this pesimistic memory usage by running the pipeline without random cropping (i.e. by using decoders.image instead of decoders.image_random_crop).

In summary, the growth in memory consumption that we observe is statistically explainable and expected (but definetely counterintuitive!). To reduce memory usage, we recommend using smaller batch size or applying other suggestions listed in my previous answer.

If you observe some behaviour of DALI that contradicts this explanation, please let us know!

twmht commented 1 year ago

@szkarpinski I find it difficult to adjust the batch size because it takes several tens of epochs of training before encountering an 'Out of Memory' situation, and training for many epochs can take several days.

And this issue occurs on a specific GPU, while other GPUs seem to not have this problem. This also results in a lot of unused memory on the other GPUs, which seems like a waste. This can be quite challenging for users.

szkarpinski commented 1 year ago

@twmht I understand that reaching OOM after so long makes the problem hard to challange and is frustrating. This behaviour is an unfortunate consequence of randomized processing and is hard to avoid. However, to adjust the batch size easier, you can do the following:

Replace decoders.image_random_crop with just decoders.image in your pipeline. This will eliminate the random cropping and simulate the most pessimistic case in which 100% of every image is decoded.
Run the pipeline for a few epochs. As there's no random cropping present, the peak memory usage should be reached relatively quickly. You have 6 shards, so after at most 6 epochs the memory usage should stabilize.
If during (2) you reach OOM or the memory usage is dangerously high, try reducing your batch size by 30-50% and try again. Repeat that until your memory usage is acceptable.
Once the memory usage is satsifying, use decoders.image_random_crop again. The memory usage of randomly cropped images should never exceed the memory usage of uncropped ones, so you shouldn't get out of memory anymore.

I hope this way you'll be able to adjust your batch size faster.

this issue occurs on a specific GPU, while other GPUs seem to not have this problem

On our side the issue occured on other GPUs too. I assume you're always observing it on 6th GPU due to the fact that DALI uses pseudorandom generators with seeds being the same for each run, meaning that image_random_crop's "randomness" will behave the same way on each run. If you wish, you can provide us with complete Python file that you are using. I'll then run it on our side to test if I can reproduce the unlucky 6th GPU problem.

x-CK-x commented 2 months ago

@twmht We've reproduced the issue and observed the sudden increase in memory consumption after multiple epochs, just as you described. However, it turns out not to be a bug.

There are two DALI pipelines (train and val) in the code and they both need to allocate some memory: our measurements show that in this case it's ~4GB per pipeline. Most of the time they run one after another, in which case each pipeline can reuse part of the memory of the other one. However, due to asynchronous nature of the GPU it's possible that the pipelines overlap slightly and in such case DALI memory pool has to grow significantly to have enough memory for both pipelines. DALI memory pool hogs the memory, meaning that once it grows it doesn't return memory back to the GPU and this is why we observe a sudden and permanent increase in memory usage that looks like a leak.

To sum up: a single pipeline uses ~4GB of memory, so it's expected for two pipelines to use ~8GB. DALI is trying to reuse the memory so the initial memory usage might be lower for tens of epochs, but eventually it will increase and reach ~8GB.

The best way to reduce the memory consumption is to reduce the batch size, but I can also suggest the following alternative ways to reduce memory usage of your code:
* You can limit the size of GPU prefetching queue (see `prefetch_queue_depth` : https://docs.nvidia.com/deeplearning/dali/archives/dali_0210_beta/dali-developer-guide/docs/pipeline.html?highlight=prefetch_queue_depth#nvidia.dali.pipeline.Pipeline)

* You can try to run the validation pipeline on the CPU

* You can reduce the number of CPU threads used. Each thread has its own instance of image decoder and each instance of image decoder has its own memory allocated on the GPU.
If you'll have any further questions, we'll be happy to help!

I would like to re-open this ticket as I am training on x2 3090s in a DDP configuration with the dali_decode_device = "mixed" but my image augmentation pipeline only contains the following in contrast to one of the later mentioned specifications of this thread. data augment code:

    def apply_augmentations(self, images):
        if 'flip' in self.augment_list:
            images = fn.flip(images, horizontal=1)
        if 'rotate' in self.augment_list:
            angle = fn.random.uniform(range=(-15.0, 15.0))
            images = fn.rotate(images, angle=angle, fill_value=0)
        if 'color_jitter' in self.augment_list:
            images = fn.color_twist(
                images,
                brightness=fn.random.uniform(range=(0.8, 1.2)),
                contrast=fn.random.uniform(range=(0.8, 1.2)),
                saturation=fn.random.uniform(range=(0.8, 1.2)),
                hue=fn.random.uniform(range=(-0.1, 0.1))
            )
        images = fn.resize(images, resize_x=self.img_width, resize_y=self.img_height)
        return images

pipeline code:

    def get_dali_pipeline(self, file_paths, labels, batch_size, num_threads, device_id, training=True):
        pipeline = Pipeline(batch_size=batch_size, num_threads=num_threads, device_id=device_id, prefetch_queue_depth=self.prefetch_queue_size)

        file_paths = [str(fp) for fp in file_paths]
        labels = [int(lbl) for lbl in labels]

        with pipeline:
            inputs, labels = fn.readers.file(files=file_paths, labels=labels, random_shuffle=training, name="Reader")
            decode_device = "mixed" if self.device.type == 'cuda' else "cpu"
            images = fn.decoders.image(inputs, device=decode_device, output_type=types.RGB)

            if training and self.is_augmentation:
                images = self.apply_augmentations(images)
            else:
                images = fn.resize(images, resize_x=self.img_width, resize_y=self.img_height)

            images = fn.crop_mirror_normalize(
                images,
                dtype=types.FLOAT,
                output_layout="CHW",
                mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
                std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
            )

            pipeline.set_outputs(images, labels)
        return pipeline

I wanted to ask if there were any other alternatives or low level locking mechanisms to prevent that pipeline overlap with my train & validation pipelines, causing the memory on gpu:0 to gradually grow the extent of an OOM?

I've tried the 3 aforementioned suggestions, without much improvement; it slows the eventual OOM down marginally depending which combination of the 3 are used. At the same time it also takes a slight performance hit it of a few iterations/sec.

Am open to trying any additional suggestions. ~ty

mzient commented 2 months ago

Hello @x-CK-x In DALI 1.42 we've introduced a new (still experimental) executor. Among other things, it uses dynamic memory allocation when possible with the end goal of increasing memory reuse along the pipeline. You can enable it by passing experimental_exec_dynamic=True to your pipeline. Version 1.42 should be available for download within the next 24 hours or so (or you can download a nightly build, if you want to have it sooner).

mzient commented 2 months ago

Hello again, @x-CK-x DALI 1.42 is now available - you can pip install --upgrade nvidia-dali-cuda120 and try the experimental_exec_dynamic=True to see if it alleviates your memory consumption issues.

x-CK-x commented 2 months ago

Hi @mzient , I just finished testing the same things I did before with the changes recommended & with the new version. I validated two observations: 1) if I leave both train & validation pipelines on using dali mixed decoding, then the memory leak reoccurs even with experimental_exec_dynamic=True 2) but this time using experimental_exec_dynamic=True if I change the validation pipeline just to cpu, even with a large batch size and decent amount of dedicated cpu threads then there is no observed memory leak

Before this version & without experimental_exec_dynamic=True when i tested 2) it also turned into an OOM. This time however did not.

In my pipeline I don't have very large validation batches, so this works well as a solution to my use case to offload that to the cpu while also being able to keep the batch & cpu_thread size large.

Thank you for your help on this!

x-CK-x commented 1 month ago

Hello again, @x-CK-x DALI 1.42 is now available - you can pip install --upgrade nvidia-dali-cuda120 and try the experimental_exec_dynamic=True to see if it alleviates your memory consumption issues.

I ran some additional tests as I grew my dataset to substantially. I did notice that the vram in usage during training never seems to decrease at any point despite the gradual increase. I guess I would've expected that as new batches are being processed at various resolutions that I would be able to observe at some point a subtle decrease in vram usage, but that didn't seem to be the case.

Is there a viable way for me to free up the memory in vram on a per batch basis without emptying the pipeline iterator?

mzient commented 4 weeks ago

I ran some additional tests as I grew my dataset to substantially. I did notice that the vram in usage during training never seems to decrease at any point despite the gradual increase. I guess I would've expected that as new batches are being processed at various resolutions that I would be able to observe at some point a subtle decrease in vram usage, but that didn't seem to be the case.

This is true and by design. DALI maintains a memory pool which never shrinks. The memory may be internally "free", but it's still owned by DALI.

Is there a viable way for me to free up the memory in vram on a per batch basis without emptying the pipeline iterator?

You can call dali.backend.ReleaseUnusedMemory(), which will purge all unoccupied physical blocks, but the performance penalty will be severe. Alternatively, you can use cudaMallocAsync as the underlying allocation method by specifying the environment variable DALI_USE_CUDA_MALLOC_ASYNC=1. This allocator by default releases (some) memory to the operating system.

NVIDIA / DALI