Closed JuanFMontesinos closed 4 years ago
BTW: I've been doing several test and reinstantiating the iterator epoch-wise seems to be what makes it to crash. Is there any alternative to that?
Hi @JuanFMontesinos, the memory consumption can also depend on the data that you are processing. Are you running out of Host or Device memory?
I think it's due to a OOM error. I was wondering if the memory used by DALI is fred if I reset it epoch-wise.
How are you doing the free and reset of DALI?
Do you mean calling reset()
on iterator? It doesn't free the memory.
We tend not to free the gpu memory as it is expensive to deallcate and allocate again, and if we went over the dataset once in an epoch it gives you a good chance of not needing to allocate again in the next epoch. The more epochs you go through, the less probable are the new allocations.
If you need a different pipeline, you can force the pipeline to deallocate (you can't do this through the iterator).
Removing all references to pipe or calling del
on it should allow you to free the memory, I'm not sure if you actually need to force the garbage collection. Some examples are in:
If the memory usage patterns of DALI don't suit your use-case, you can alter them and allow DALI to reallocate memory if it currently needs a smaller buffers. You can read more on that topic in the advanced section of our docs: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/advanced_topics.html#memory-consumption
Just mentioning to check whether raising StopIteration in iter_setup is the proper way to go and if PREFETCH_QUEUE_DEPTH =3 can cause the OOM for the given setup
The PREFETCH_QUEUE_DEPTH will have some impact on occupied memory (the bigger queue, the more memory will be used), but it should stabilize at some point.
It works by creating a queue of buffers between stages (CPU/Mixed/GPU) and at the outputs of your pipeline. I think we tend to use PREFETCH_QUEUE_DEPTH=2
, but the optimal configuration can depend on how fast you're data is processed vs how fast it can be consumed.
Between the operators in given stage for example VideoReader -> Normalize in your case, the buffers are not duplicated as given stage processes one batch at a time, and outputs to the mentioned queue of PREFETCH_QUEUE_DEPTH
buffers.
Hi, After doing some test I will explain my insights:
Sooo In concret, what (i think) is happening is the following:
I'm working with 2 GPUs, one for training and another one for preprocessing.
I need to load paired audiovisual files, therefore I made a static dali pipe (no shuffle) precomputing which samples to load in an epoch. This allows me to load paired audio and video making use of file_list_frame_num
and having a list of files to load from the audio file.
Preprocessed files are converted into pytorch tensors by DaliIterator (so that they are in cuda1). I move them to cpu and then to the training gpu (gpu 0). In theory cuda0 and cuda1 arent connected in any way (at least from the side of my code) since I also made sure to be restarting the tensor once it's copied to cuda0.
I think that there is a conflict in the memory allocation/access between Pytorch and dali. Because the exception raises running a pytorch convolution in cuda0. Is it possible that nvidia dali is initializing some package (cuDNN, CUDA) in gpu0 even if this one is not explicitly used? I see it takes 12 Mb from all the gpus no matter what
This are my observations about memory usage: Code:
for trainer.epoch in range(trainer.start_epoch,trainer.EPOCHS):
if DALI:
try:
del pipe
del train_loader
torch.cuda.empty_cache()
except:
pass
pipe = VnBSS.get_dali_pipeline(batch_size=ex.hyptrs.batch_size,
num_threads=cpu_count(),
device_id=1,
dataset=train_ds,
seed=-1,
debug=DEBUG,
resize=ex.hyptrs.resize)
print('Building DALI pipeline... (This may take a while)')
pipe.build()
print('Done!')
train_loader = dali_processor(
DALIGenericIterator([pipe], output_map=['sp1', 'sp2', 'spm', 'sk', 'vd', 'ad1', 'ad2', 'index'],
size=-1))
# size=pipe.epoch_size('video')))
# for i,_ in enumerate(train_loader):
# if i == 5:
# break
# torch.backends.cudnn.enabled = False
trainer.run_epoch(train_loader, 'train', metrics=['loss'], send=send)
torch.cuda.empty_cache()
With regard to your question:
How are you doing the free and reset of DALI?
I was assuming that by overwritting the variables pipe
and train_loader
everything would be ok.
Soo it's not an OOM issue. The code runs for 3 epochs and the crashes.
Modifiying the code according to those posts:
pipe._pipe = None
del pipe
del train_loader
torch.cuda.empty_cache()
gc.collect()
Frees more memory in cuda1.
Is there a way to tell DALI to ignore cuda0 (without using CUDA_VISIBLE_DEVICES
since I need pytorch to be aware of it)
Some more weird data. If i reduce match size by 1/2 happens 1 epoch after. If I reduce it by 1/4 it happens in ht epoch 16.
So is there a way to pass CUDA_VISIBLE_DEVICES
only to dali?
(os.environ doesn't work for either torch or dali)
Another hint is that calling
torch.backends.cudnn.enabled = False
Solves the issue (but harms the speed)
BTW: Running everything in gpu0 solves the issue. But I don't really know why.
DALI should stick to using the device that you provided, we set the current device and reset it back with the https://github.com/NVIDIA/DALI/blob/master/include/dali/core/device_guard.h
Theoretically all calls should be using the device that you provided, so there should be no need to hide any devices from DALI.
Some more weird data. If i reduce match size by 1/2 happens 1 epoch after. If I reduce it by 1/4 it happens in ht epoch 16.
This looks like it's scaling with the batch size and would indicate some OOM. Are you sure you're not accumulating the data somewhere? If you somehow keep the tensors obtained from DALI they might not be freed.
Can you check your GPU memory occupancy? Something like:
nvidia-smi --query-gpu=name,memory.used,utilization.memory --format=csv -l 1
can be helpful to log the memory used by each gpu.
Can you try to prepare some minimal example that reproduces the issue so we could try analyzing that? The best would be if you could repro it just by passing some dummy data through a basic DALI pipeline and copy it to some simple torch nn for consumption (that probably needs to utilize cudnn somehow).
I'm gonna give up.
I tried to reproduce it at the very beggining but it seems to involve cudnn.
It's a bit strange that it runs perfectly (without strange paraphernalia like gc, and memo) if use cuda0 for both dali and pytorch.
The gpu log is somehow reported in that table (that's gpu usage according to nvidia-smi at different times obtained by setting breakpoints)
I've been looking at the gpu usage the whole time nvidia-smi -n 0
and it doesn't look like a OOM (it0s never closed to GPU max mem). I think there is some sort of bug with this hardware. I realize some time ago that for this computer allocating tensors:
cuda1-> cuda0
was making tensors allocated in cuda:0 to be zeroed.
I was using a by-pass by doing cuda1-->cpu-->cuda0 here but I'm wondering if it can be wrong too.
Anyway I can confirm that building the pipe requires 10 Mb from the non-used gpu. (nvidia-smi shows mem usage from 2Mb--> 12 Mb) and this doesn't occur if it's disabled via CUDA_VISIBLE_DEVICES
It's not DALI's question but is there any test I can run to check whether allocation between devices is OK?
Thank you for your time anyway If I discover something else I will let you know
Hi, I think the problem may come from memory fragmentation (but this is my guess).
Anyway I can confirm that building the pipe requires 10 Mb from the non-used gpu. (nvidia-smi shows mem usage from 2Mb--> 12 Mb) and this doesn't occur if it's disabled via CUDA_VISIBLE_DEVICES
Can you provide a minimal repro for that? Does that happen when you build a pipeline or create the iterator (when you create the iterator PyTorch may create some context on the GPU0 as we import it there)? Another thing that comes to my mind is that maybe some operator is misbehaving and allocates something in GPU 0?
I think there is some sort of bug with this hardware. I realize some time ago that for this computer allocating tensors: cuda1-> cuda0 was making tensors allocated in cuda:0 to be zeroed. I was using a by-pass by doing cuda1-->cpu-->cuda0 here but I'm wondering if it can be wrong too.
According to https://discuss.pytorch.org/t/how-does-pytorch-transfer-data-between-gpus/83954
.to('cuda:0')
does a device to device copy of the tensor. Did you try that and does it produce any weird results?
It's not a concrete idea, but maybe there is some race between copying D2D and data being produced on some Pytorch stream (which the 0s might suggest).
Hi, Thank you very much for helping.
import torch
a=torch.rand(7).cuda(1)
b=a.cuda() # Same effect with a.to('cuda:0')
torch.cuda.synchronize()
c=b.cpu()
print(a)
print(b)
print(b.sum())
So this seems to produce an issue:
tensor([0.8375, 0.2387, 0.0349, 0.9849, 0.6205, 0.1031, 0.8084],
device='cuda:1')
tensor([0., 0., 0., 0., 0., 0., 0.], device='cuda:0')
tensor(0., device='cuda:0')
Process finished with exit code 0
I asked when I realized and they think it can be a hardware deffect. The issue doesn't occur the other way around (from gpu0 to gpu1)
Anyway I've been trying to reproduce it and I cannot. I roughly used the same operators en none of them seem to be the problem. I'll let you know if I discover anything else.
Thank you very much
As it is not DALI related you can try to report it to PyTorch developers and see how that goes.
Yep I know. Just mentioning since you may have hear of it. They have no clue and assume hardward error. Anyway thanks for your help
Hi, I've got this error and I think it can be a related to dali.
I'm training a network with the following code:
The important keypoint here is I have to redifine DALI's pipeline epochwise. I've realized that memory usage increases through training and end up getting this error.
I think it's due to a OOM error. I was wondering if the memory used by DALI is fred if I reset it epoch-wise.
In case it's necessary the pipeline is defined as:
Where
Just mentioning to check whether raising StopIteration in
iter_setup
is the proper way to go and ifPREFETCH_QUEUE_DEPTH =3
can cause the OOM for the given setup