Open twmht opened 1 year ago
Here are related logs after training over the half of the epochs
Now I upgrade to latest version of dali v1.30.0 to see if the problem still there.
it seems like the memory comsumption is much normal than v1.21
Hi @twmht,
You can find all changes introduced in the recent DALI releases here, we fixed at least one memory leak detected. Also regarding uneven memory consumption, as DALI uses memory pools, when for a given GPU the memory usage crosses a given threshold, another chunk is allocated and that is why one GPU can use more than the others (the memory consumed by randomly formed batches of samples could be just higher). You can also consider reducing the batch size to reduce consumed memory.
another chunk is allocated and that is why one GPU can use more than the others
@JanuszL
Very appreciated for the explanation, So does this situation still exist in the latest version?
@twmht - if you are asking if DALI can unequally consume the memory between GPUs, then the answer is yes. However, as the training progresses, due to entropy the memory consumption peak should be reached on each GPU the size of the allocated pools should equalize. When it comes to the overall consumption, the best you can do is reduce the batch size/number of CPU threads used (as the mixed image decoder uses CPU and creates one decoder instance per thread, and each of it needs a separate scratch space memory).
@JanuszL
It's not easy to determine the batch size, because it used almost equally memory across gpus in the beggining. everying is fine unitl we haved trained tens of epochs.
like
However, as the training proceeds to tens of the epochs, the memory usage of 6th gpu increased a lot.
like
the training is still in progress but i guess finally it would still raise the exception
is this still normal? I have upgraded to v1.30.0.
@twmht that is not expected from the DALI side. Could you run the DALI pipeline alone without the training and see if that memory growth still occurs to rule out the DL FW itself?
that sounds a good try.
the training is still in progress but i guess finally it would still raise the exception
it did.
@JanuszL
Even after detaching the training part, I still observed abnormal memory allocation after several epochs
@twmht, that is a good lead. Can you tell us which data set you are using for the test so we can reproduce it on our side? Is it ImageNet?
@JanuszL
I use ImageNet for training and convert it to MXNet Record format (using https://github.com/apache/mxnet/blob/master/tools/im2rec.py). I'm utilizing the DALIGenericIterator (https://docs.nvidia.com/deeplearning/dali/user-guide/docs/plugins/pytorch_plugin_api.html#nvidia.dali.plugin.pytorch.DALIGenericIterator), and for the pipeline, I reference everything except the Reader from https://github.com/NVIDIA/DALI/blob/main/docs/examples/use_cases/pytorch/resnet50/main.py.
And I'm using six 1080ti GPUs. Abnoraml memory allocation issues always occur on the sixth GPU.
Here are the RecordIO file reader
device_id, num_gpus = get_dist_info()
images, labels = fn.readers.mxnet(
path=[db_path + '/train.rec'],
index_path=[db_path + '/train.idx'],
random_shuffle=True,
initial_fill=32768,
pad_last_batch=True,
shard_id=device_id,
num_shards=num_gpus,
name='Reader')
Pytorch version is 1.12.1 and cuda is 11.3.
@twmht , thank you for your answer. We'll try to reproduce the issue
@twmht We've reproduced the issue and observed the sudden increase in memory consumption after multiple epochs, just as you described. However, it turns out not to be a bug.
There are two DALI pipelines (train
and val
) in the code and they both need to allocate some memory: our measurements show that in this case it's ~4GB per pipeline. Most of the time they run one after another, in which case each pipeline can reuse part of the memory of the other one. However, due to asynchronous nature of the GPU it's possible that the pipelines overlap slightly and in such case DALI memory pool has to grow significantly to have enough memory for both pipelines. DALI memory pool hogs the memory, meaning that once it grows it doesn't return memory back to the GPU and this is why we observe a sudden and permanent increase in memory usage that looks like a leak.
To sum up: a single pipeline uses ~4GB of memory, so it's expected for two pipelines to use ~8GB. DALI is trying to reuse the memory so the initial memory usage might be lower for tens of epochs, but eventually it will increase and reach ~8GB.
The best way to reduce the memory consumption is to reduce the batch size, but I can also suggest the following alternative ways to reduce memory usage of your code:
prefetch_queue_depth
: https://docs.nvidia.com/deeplearning/dali/archives/dali_0210_beta/dali-developer-guide/docs/pipeline.html?highlight=prefetch_queue_depth#nvidia.dali.pipeline.Pipeline)If you'll have any further questions, we'll be happy to help!
@JanuszL
Thank you!
In fact I had already used cpu for validation pipeline, Still raised OOM aftert training several epochs. i will try other suggessions you provided.
@twmht It's surprising that even with validation pipeline on the CPU you still got OOM. Can you please share with us the exact python code that you're using? We'll try again to reproduce your problem and investigate it further.
@szkarpinski , I use the validation pipeline from mmpretrain(https://github.com/open-mmlab/mmpretrain/blob/main/configs/_base_/datasets/imagenet_bs32.py#L21), and I just use mmpretrain code to train ImageNet. Because mmpretrain does not support the DALI data loader, I wrote my own, and it is identical to your sample code(https://github.com/NVIDIA/DALI/blob/main/docs/examples/use_cases/pytorch/resnet50/main.py).
You can refer the code here (https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/datasets/transforms/processing.py#L781)
You can directly use their provided API to create a PyTorch DataLoader(https://mmengine.readthedocs.io/en/latest/tutorials/dataset.html).
But I believe the issue should not be with the validation dataloader because it is unrelated to DALI. What do you think?
And to the best of my recollection, I remember that earlier versions of DALI didn't have this issue, at least not to my knowledge, but I can't recall the specific version.
@twmht , thank you for your response. I agree that CPU-only DALI-independent validation pipeline shouldn't cause any problems.
If I understand correctly, to reproduce your setup I should:
Is that correct?
I'd also like to confirm with you that the screenshot with 6GB used on sixth GPU and 4GB on the rest was taken when running without pytorch and without DALI validation pipeline. In other words, when running the repro described above I should expect to see memory grow to at least 6GB. Is that right?
@szkarpinski
Convert imagenet to mxnet record format using the tool you linked above
Yes
Take our resnet sample and: remove pytorch code, replace the reader with mxnet reader, remove validation pipeline.
More precisely, to remove the PyTorch training code and have it solely retrieve batches from the dataloader
I'd also like to confirm with you that the screenshot with 6GB used on sixth GPU and 4GB on the rest was taken when running without pytorch and without DALI validation pipeline. In other words, when running the repro described above I should expect to see memory grow to at least 6GB. Is that right?
I still use pytorch because I have to use DALIGenericIterator to build pytorch dataloader.
by the way, now I use 4 2080ti gpus, the memory allocation issue has been significantly alleviated. , I'm not sure if it's related to the number of GPUs being used.
@twmht I confirm that the problem is still reproducible even with one pipeline. We'll continue our investigation and let you know once we have some more information.
@twmht I realized that those sudden increases in memory consumption after tens of epochs are statistically explainable and are due to the usage of image_random_crop
.
In the pipeline, we use image_random_crop(..., random_area=[0.1, 1])
. This means that we want the cropped image to be between 10% and 100% of the original size. DALI samples the actual fraction uniformly from that range. This means that memory used by each image has uniform distribution.
Images are processed in batches of 256. The total memory used by a batch (which is just the sum of memory used by each image) has normal distribution. This means that medium-size batches are the most frequent, while really big batches are very rare. It might take tens of thousands of iterations to get such really big batch, and once we get it, the memory usage grows suddenly. Our calculations and simulations based on the above description match the experimental results.
This means that the memory usage might grow from time to time (sometimes very rapidly) for tens of epochs until for some batch the pesimistic case is reached and all sampled areas are close to 1. You can estimate this pesimistic memory usage by running the pipeline without random cropping (i.e. by using decoders.image
instead of decoders.image_random_crop
).
In summary, the growth in memory consumption that we observe is statistically explainable and expected (but definetely counterintuitive!). To reduce memory usage, we recommend using smaller batch size or applying other suggestions listed in my previous answer.
If you observe some behaviour of DALI that contradicts this explanation, please let us know!
@szkarpinski I find it difficult to adjust the batch size because it takes several tens of epochs of training before encountering an 'Out of Memory' situation, and training for many epochs can take several days.
And this issue occurs on a specific GPU, while other GPUs seem to not have this problem. This also results in a lot of unused memory on the other GPUs, which seems like a waste. This can be quite challenging for users.
@twmht I understand that reaching OOM after so long makes the problem hard to challange and is frustrating. This behaviour is an unfortunate consequence of randomized processing and is hard to avoid. However, to adjust the batch size easier, you can do the following:
decoders.image_random_crop
with just decoders.image
in your pipeline. This will eliminate the random cropping and simulate the most pessimistic case in which 100% of every image is decoded.decoders.image_random_crop
again. The memory usage of randomly cropped images should never exceed the memory usage of uncropped ones, so you shouldn't get out of memory anymore.I hope this way you'll be able to adjust your batch size faster.
this issue occurs on a specific GPU, while other GPUs seem to not have this problem
On our side the issue occured on other GPUs too. I assume you're always observing it on 6th GPU due to the fact that DALI uses pseudorandom generators with seeds being the same for each run, meaning that image_random_crop
's "randomness" will behave the same way on each run. If you wish, you can provide us with complete Python file that you are using. I'll then run it on our side to test if I can reproduce the unlucky 6th GPU problem.
@twmht We've reproduced the issue and observed the sudden increase in memory consumption after multiple epochs, just as you described. However, it turns out not to be a bug.
There are two DALI pipelines (
train
andval
) in the code and they both need to allocate some memory: our measurements show that in this case it's ~4GB per pipeline. Most of the time they run one after another, in which case each pipeline can reuse part of the memory of the other one. However, due to asynchronous nature of the GPU it's possible that the pipelines overlap slightly and in such case DALI memory pool has to grow significantly to have enough memory for both pipelines. DALI memory pool hogs the memory, meaning that once it grows it doesn't return memory back to the GPU and this is why we observe a sudden and permanent increase in memory usage that looks like a leak.To sum up: a single pipeline uses ~4GB of memory, so it's expected for two pipelines to use ~8GB. DALI is trying to reuse the memory so the initial memory usage might be lower for tens of epochs, but eventually it will increase and reach ~8GB.
The best way to reduce the memory consumption is to reduce the batch size, but I can also suggest the following alternative ways to reduce memory usage of your code:
* You can limit the size of GPU prefetching queue (see `prefetch_queue_depth` : https://docs.nvidia.com/deeplearning/dali/archives/dali_0210_beta/dali-developer-guide/docs/pipeline.html?highlight=prefetch_queue_depth#nvidia.dali.pipeline.Pipeline) * You can try to run the validation pipeline on the CPU * You can reduce the number of CPU threads used. Each thread has its own instance of image decoder and each instance of image decoder has its own memory allocated on the GPU.
If you'll have any further questions, we'll be happy to help!
I would like to re-open this ticket as I am training on x2 3090s in a DDP configuration with the dali_decode_device = "mixed" but my image augmentation pipeline only contains the following in contrast to one of the later mentioned specifications of this thread. data augment code:
def apply_augmentations(self, images):
if 'flip' in self.augment_list:
images = fn.flip(images, horizontal=1)
if 'rotate' in self.augment_list:
angle = fn.random.uniform(range=(-15.0, 15.0))
images = fn.rotate(images, angle=angle, fill_value=0)
if 'color_jitter' in self.augment_list:
images = fn.color_twist(
images,
brightness=fn.random.uniform(range=(0.8, 1.2)),
contrast=fn.random.uniform(range=(0.8, 1.2)),
saturation=fn.random.uniform(range=(0.8, 1.2)),
hue=fn.random.uniform(range=(-0.1, 0.1))
)
images = fn.resize(images, resize_x=self.img_width, resize_y=self.img_height)
return images
pipeline code:
def get_dali_pipeline(self, file_paths, labels, batch_size, num_threads, device_id, training=True):
pipeline = Pipeline(batch_size=batch_size, num_threads=num_threads, device_id=device_id, prefetch_queue_depth=self.prefetch_queue_size)
file_paths = [str(fp) for fp in file_paths]
labels = [int(lbl) for lbl in labels]
with pipeline:
inputs, labels = fn.readers.file(files=file_paths, labels=labels, random_shuffle=training, name="Reader")
decode_device = "mixed" if self.device.type == 'cuda' else "cpu"
images = fn.decoders.image(inputs, device=decode_device, output_type=types.RGB)
if training and self.is_augmentation:
images = self.apply_augmentations(images)
else:
images = fn.resize(images, resize_x=self.img_width, resize_y=self.img_height)
images = fn.crop_mirror_normalize(
images,
dtype=types.FLOAT,
output_layout="CHW",
mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
)
pipeline.set_outputs(images, labels)
return pipeline
I wanted to ask if there were any other alternatives or low level locking mechanisms to prevent that pipeline overlap with my train & validation pipelines, causing the memory on gpu:0 to gradually grow the extent of an OOM?
I've tried the 3 aforementioned suggestions, without much improvement; it slows the eventual OOM down marginally depending which combination of the 3 are used. At the same time it also takes a slight performance hit it of a few iterations/sec.
Am open to trying any additional suggestions. ~ty
Hello @x-CK-x
In DALI 1.42 we've introduced a new (still experimental) executor. Among other things, it uses dynamic memory allocation when possible with the end goal of increasing memory reuse along the pipeline. You can enable it by passing experimental_exec_dynamic=True
to your pipeline.
Version 1.42 should be available for download within the next 24 hours or so (or you can download a nightly build, if you want to have it sooner).
Hello again, @x-CK-x
DALI 1.42 is now available - you can pip install --upgrade nvidia-dali-cuda120
and try the experimental_exec_dynamic=True
to see if it alleviates your memory consumption issues.
Hi @mzient ,
I just finished testing the same things I did before with the changes recommended & with the new version.
I validated two observations:
1) if I leave both train & validation pipelines on using dali mixed
decoding, then the memory leak reoccurs even with experimental_exec_dynamic=True
2) but this time using experimental_exec_dynamic=True
if I change the validation pipeline just to cpu, even with a large batch size and decent amount of dedicated cpu threads then there is no observed memory leak
Before this version & without experimental_exec_dynamic=True
when i tested 2) it also turned into an OOM. This time however did not.
In my pipeline I don't have very large validation batches, so this works well as a solution to my use case to offload that to the cpu while also being able to keep the batch & cpu_thread size large.
Thank you for your help on this!
Hello again, @x-CK-x DALI 1.42 is now available - you can
pip install --upgrade nvidia-dali-cuda120
and try theexperimental_exec_dynamic=True
to see if it alleviates your memory consumption issues.
I ran some additional tests as I grew my dataset to substantially. I did notice that the vram in usage during training never seems to decrease at any point despite the gradual increase. I guess I would've expected that as new batches are being processed at various resolutions that I would be able to observe at some point a subtle decrease in vram usage, but that didn't seem to be the case.
Is there a viable way for me to free up the memory in vram on a per batch basis without emptying the pipeline iterator?
I ran some additional tests as I grew my dataset to substantially. I did notice that the vram in usage during training never seems to decrease at any point despite the gradual increase. I guess I would've expected that as new batches are being processed at various resolutions that I would be able to observe at some point a subtle decrease in vram usage, but that didn't seem to be the case.
This is true and by design. DALI maintains a memory pool which never shrinks. The memory may be internally "free", but it's still owned by DALI.
Is there a viable way for me to free up the memory in vram on a per batch basis without emptying the pipeline iterator?
You can call dali.backend.ReleaseUnusedMemory()
, which will purge all unoccupied physical blocks, but the performance penalty will be severe.
Alternatively, you can use cudaMallocAsync
as the underlying allocation method by specifying the environment variable DALI_USE_CUDA_MALLOC_ASYNC=1
. This allocator by default releases (some) memory to the operating system.
Version
1.21
Describe the bug.
I used the code from tutorial to train ImageNet (https://github.com/NVIDIA/DALI/blob/main/docs/examples/use_cases/pytorch/resnet50/main.py) , I have six 1080ti gpus.
However, the memory comsumption of 6th gpu was always larger and increasing when training, and would throw the OOM exception in the middle of training.
for example, here are my nvidia smi, the memory comsumption of 6th gpu was larger when compared to others.
Any idea?
Minimum reproducible example
Relevant log output
No response
Other/Misc.
No response
Check for duplicates