NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.15k stars 621 forks source link

CUDNN_STATUS_MAPPING_ERROR #2225

Closed JuanFMontesinos closed 4 years ago

JuanFMontesinos commented 4 years ago

Hi, I've got this error and I think it can be a related to dali.

I'm training a network with the following code:

    with ex.autoconfig(trainer) as trainer:
        for trainer.epoch in range(trainer.start_epoch,trainer.EPOCHS):
            if DALI:
                pipe = VnBSS.get_dali_pipeline(batch_size=ex.hyptrs.batch_size,
                                               num_threads=cpu_count(),
                                               device_id=1,
                                               dataset=train_ds,
                                               seed=5,
                                               debug=DEBUG,
                                               resize=ex.hyptrs.resize)
                print('Building DALI pipeline... (This may take  a while)')
                pipe.build()
                print('Done!')
                train_loader = dali_processor(
                    DALIGenericIterator([pipe], output_map=['sp1', 'sp2', 'spm', 'sk', 'vd', 'ad1', 'ad2', 'index'],
                                        size=-1))
                                        # size=pipe.epoch_size('video')))
            trainer.run_epoch(train_loader, 'train', metrics=['loss'], send=send)

            with torch.no_grad():
                trainer.run_epoch(valid_loader, 'val',
                                  metrics=['loss', 'sdr'],
                                  checkpoint=trainer.checkpoint(metric='loss', freq=2),
                                  send=send)
                if ex.hyptrs.validate_roch:
                    trainer.run_epoch(urmp__loader, 'urmp', metrics=['loss', 'sdr'], send=send)

The important keypoint here is I have to redifine DALI's pipeline epochwise. I've realized that memory usage increases through training and end up getting this error.

  File "/home/jfm/.local/lib/python3.6/site-packages/flerken/framework/framework.py", line 206, in backpropagate
    self.loss.backward()
  File "/home/jfm/.local/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/jfm/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
cuDNN error: CUDNN_STATUS_MAPPING_ERROR
Exception raised from operator() at /pytorch/aten/src/ATen/native/cudnn/Conv.cpp:980 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f7e2106d1e2 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xebae82 (0x7f7e22390e82 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xebcdb5 (0x7f7e22392db5 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xeb800e (0x7f7e2238e00e in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xeb9bfb (0x7f7e2238fbfb in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #5: at::native::cudnn_convolution_backward_input(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) + 0xb2 (0x7f7e22390152 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xf1f35b (0x7f7e223f535b in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xf4f178 (0x7f7e22425178 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #8: at::cudnn_convolution_backward_input(c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool) + 0x1ad (0x7f7e5d2cd88d in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #9: at::native::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, std::array<bool, 2ul>) + 0x223 (0x7f7e2238e823 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0xf1f445 (0x7f7e223f5445 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #11: <unknown function> + 0xf4f1d4 (0x7f7e224251d4 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cuda.so)
frame #12: at::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, std::array<bool, 2ul>) + 0x1e2 (0x7f7e5d2dc242 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x2ec9c62 (0x7f7e5ef9fc62 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x2ede224 (0x7f7e5efb4224 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #15: at::cudnn_convolution_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::ArrayRef<long>, c10::ArrayRef<long>, long, bool, bool, std::array<bool, 2ul>) + 0x1e2 (0x7f7e5d2dc242 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::generated::CudnnConvolutionBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x258 (0x7f7e5ee26c38 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x3375bb7 (0x7f7e5f44bbb7 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #18: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1400 (0x7f7e5f447400 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #19: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) + 0x451 (0x7f7e5f447fa1 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #20: torch::autograd::Engine::thread_init(int,std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x89 (0x7f7e5f440119 in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
frame #21: torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) + 0x4a (0x7f7e6cbe04ba in /home/jfm/.local/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #22: <unknown function> + 0xbd6df (0x7f7e6dd3c6df in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #23: <unknown function> + 0x76db (0x7f7e701f56db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #24: clone + 0x3f (0x7f7e7052ea3f in /lib/x86_64-linux-gnu/libc.so.6)

I think it's due to a OOM error. I was wondering if the memory used by DALI is fred if I reset it epoch-wise.

In case it's necessary the pipeline is defined as:

class AudioVisualPipe(Pipeline):
    def __init__(self, batch_size: int, num_threads: int, device_id: int, dataset: BSSDataset, seed: int, debug: bool,
                 resize: bool):
        super(AudioVisualPipe, self).__init__(batch_size, num_threads, device_id, seed=seed,
                                              prefetch_queue_depth=PREFETCH_QUEUE_DEPTH,
                                              exec_pipelined=EXEC_PIPELINED)
        self.dataset = dataset
        self.debug = debug
        if resize:
            self.input = ops.VideoReaderResize(device="gpu", file_list=DALI_VIDEO_DATASET_PATH,
                                               sequence_length=N_VIDEO_FRAMES,
                                               shard_id=0, num_shards=1, file_list_frame_num=True,
                                               random_shuffle=False, skip_vfr_check=True,
                                               resize_x=VID_RESIZE_X, resize_y=VID_RESIZE_Y)
        else:
            self.input = ops.VideoReader(device="gpu", file_list=DALI_VIDEO_DATASET_PATH,
                                         sequence_length=N_VIDEO_FRAMES,
                                         shard_id=0, num_shards=1, file_list_frame_num=True,
                                         random_shuffle=False, skip_vfr_check=True)
        self.normalize = ops.Normalize(device='gpu', batch=False, axes=[0, 1, 2])
        # self.normalize = ops.Normalize(device='gpu', batch=False)
        # Axes are the dims over you don't want to normalize. Batch dim doesnt count in the indexing
        # This means that (mean,std)=(0,1) in the missing dimension, 3.

        if STFT_WINDOW.__name__ == 'hann_window':
            window_fn = []
            # window_fn = STFT_WINDOW(N_FFT).tolist()
        else:
            window_fn = STFT_WINDOW(N_FFT).tolist()
        self.spectrogram = ops.Spectrogram(device="gpu",
                                           nfft=N_FFT,
                                           window_length=N_FFT,
                                           window_step=HOP_LENGTH,
                                           window_fn=window_fn,
                                           power=1)  # Power 1 matches torch/librosa
        # power 2 is the power2 spectrogram of the magnitude
        self.index_data = ops.ExternalSource()
        self.audio_main_data = ops.ExternalSource()
        self.audio_slave_data = ops.ExternalSource()
        self.skeleton_data = ops.ExternalSource()
        if self.debug:
            self.traces = self.dataset.prepare_data_for_dali(DALI_VIDEO_DATASET_PATH, 5)
        else:
            self.traces = self.dataset.prepare_data_for_dali(DALI_VIDEO_DATASET_PATH, 40)
        # Divide the list in chunks of size=batch_size
        self.idx = 0
        print('Precomputing samples in the batch')
        self.traces = [self.traces[i * batch_size:(i + 1) * batch_size] for i in
                       range((len(self.traces) + batch_size - 1) // batch_size)]
        print('Done!')

    def define_graph(self):
        video = self.input(name='video')
        video = self.normalize(video[0])
        self.audio_main = self.audio_main_data(name='audio_main').gpu()
        self.audio_slave = self.audio_slave_data(name='audio_slave').gpu()
        self.index = self.index_data()
        sp_main = self.spectrogram(self.audio_main)
        sp_slave = self.spectrogram(self.audio_slave)
        sp_mix = self.spectrogram(self.audio_main + self.audio_slave)
        self.skeleton = self.skeleton_data(name='skeleton')

        # Expected output by pytorch ['sp1', 'sp2', 'spm', 'sk', 'vd']
        return sp_main, sp_slave, sp_mix, self.skeleton, video, self.audio_main, self.audio_slave, self.index

    @staticmethod
    def audio_aug(audio_list: List[np.ndarray]):
        tmp = []
        for audio in audio_list:
            if audio.sum() > AUDIO_LENGTH * 0.15:
                audio /= audio.max()
            # audio *= float(randint(7, 10) / 10)
            tmp.append(audio)
        return tmp

    def iter_setup(self):
        trace_main = reformat_trace(self.traces[self.idx], 0)
        trace_slave = reformat_trace(self.traces[self.idx], 1)
        audio_main, skeleton = self.dataset.getitem(trace_main, 2, ['audio', 'skeleton_npy'])
        # Converting int16 into -1,1 float
        audio_main = [(x / 2 ** 15).astype(np.float32) if x.dtype == np.int16 else x for x in
                      audio_main]
        # Data Augmentation
        audio_main = self.audio_aug(audio_main)
        if self.debug:
            for audio_i in audio_main:
                if audio_i.max() > 1:
                    raise Exception(f'Audio is being read as {audio_i.dtype}')
            for sk_i in skeleton:
                if sk_i.min() < 0:
                    warn(f'Skeleton values are below 0. Dtype: {sk_i.dtype}')
        audio_slave = self.dataset.getitem(trace_slave, 2, ['audio'])

        audio_slave = [(x / 2 ** 15).astype(np.float32) if x.dtype == np.int16 else x for x in
                       audio_slave[0]]  # Converting int16 into -1,1 float
        audio_slave = self.audio_aug(audio_slave)
        if self.debug:
            for audio_i in audio_slave:
                if audio_i.max() > 1:
                    raise Exception(f'Audio is being read as {audio_i.dtype}')
        self.feed_input(self.audio_main, audio_main)
        self.feed_input(self.skeleton, skeleton)
        self.feed_input(self.audio_slave, audio_slave)
        self.feed_input(self.index, [np.array(self.idx) for _ in range(len(audio_main))])
        self.idx += 1
        if self.idx >= len(self.traces):
            raise StopIteration

Where

PREFETCH_QUEUE_DEPTH = 3
EXEC_PIPELINED = True

Just mentioning to check whether raising StopIteration in iter_setup is the proper way to go and if PREFETCH_QUEUE_DEPTH =3 can cause the OOM for the given setup

JuanFMontesinos commented 4 years ago

BTW: I've been doing several test and reinstantiating the iterator epoch-wise seems to be what makes it to crash. Is there any alternative to that?

klecki commented 4 years ago

Hi @JuanFMontesinos, the memory consumption can also depend on the data that you are processing. Are you running out of Host or Device memory?

I think it's due to a OOM error. I was wondering if the memory used by DALI is fred if I reset it epoch-wise.

How are you doing the free and reset of DALI? Do you mean calling reset() on iterator? It doesn't free the memory.

We tend not to free the gpu memory as it is expensive to deallcate and allocate again, and if we went over the dataset once in an epoch it gives you a good chance of not needing to allocate again in the next epoch. The more epochs you go through, the less probable are the new allocations.

If you need a different pipeline, you can force the pipeline to deallocate (you can't do this through the iterator). Removing all references to pipe or calling del on it should allow you to free the memory, I'm not sure if you actually need to force the garbage collection. Some examples are in:

2070 and #1842.

If the memory usage patterns of DALI don't suit your use-case, you can alter them and allow DALI to reallocate memory if it currently needs a smaller buffers. You can read more on that topic in the advanced section of our docs: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/advanced_topics.html#memory-consumption

Just mentioning to check whether raising StopIteration in iter_setup is the proper way to go and if PREFETCH_QUEUE_DEPTH =3 can cause the OOM for the given setup

The PREFETCH_QUEUE_DEPTH will have some impact on occupied memory (the bigger queue, the more memory will be used), but it should stabilize at some point. It works by creating a queue of buffers between stages (CPU/Mixed/GPU) and at the outputs of your pipeline. I think we tend to use PREFETCH_QUEUE_DEPTH=2, but the optimal configuration can depend on how fast you're data is processed vs how fast it can be consumed. Between the operators in given stage for example VideoReader -> Normalize in your case, the buffers are not duplicated as given stage processes one batch at a time, and outputs to the mentioned queue of PREFETCH_QUEUE_DEPTH buffers.

JuanFMontesinos commented 4 years ago

Hi, After doing some test I will explain my insights:

Sooo In concret, what (i think) is happening is the following: I'm working with 2 GPUs, one for training and another one for preprocessing. I need to load paired audiovisual files, therefore I made a static dali pipe (no shuffle) precomputing which samples to load in an epoch. This allows me to load paired audio and video making use of file_list_frame_num and having a list of files to load from the audio file.

Preprocessed files are converted into pytorch tensors by DaliIterator (so that they are in cuda1). I move them to cpu and then to the training gpu (gpu 0). In theory cuda0 and cuda1 arent connected in any way (at least from the side of my code) since I also made sure to be restarting the tensor once it's copied to cuda0.

I think that there is a conflict in the memory allocation/access between Pytorch and dali. Because the exception raises running a pytorch convolution in cuda0. Is it possible that nvidia dali is initializing some package (cuDNN, CUDA) in gpu0 even if this one is not explicitly used? I see it takes 12 Mb from all the gpus no matter what

This are my observations about memory usage: Code:

        for trainer.epoch in range(trainer.start_epoch,trainer.EPOCHS):
            if DALI:
                try:
                    del pipe
                    del train_loader
                    torch.cuda.empty_cache()
                except:
                    pass

                pipe = VnBSS.get_dali_pipeline(batch_size=ex.hyptrs.batch_size,
                                               num_threads=cpu_count(),
                                               device_id=1,
                                               dataset=train_ds,
                                               seed=-1,
                                               debug=DEBUG,
                                               resize=ex.hyptrs.resize)
                print('Building DALI pipeline... (This may take  a while)')
                pipe.build()
                print('Done!')
                train_loader = dali_processor(
                    DALIGenericIterator([pipe], output_map=['sp1', 'sp2', 'spm', 'sk', 'vd', 'ad1', 'ad2', 'index'],
                                        size=-1))
                                        # size=pipe.epoch_size('video')))
                # for i,_ in enumerate(train_loader):
                #     if i == 5:
                #         break
            # torch.backends.cudnn.enabled = False
            trainer.run_epoch(train_loader, 'train', metrics=['loss'], send=send)
            torch.cuda.empty_cache()

imagen

With regard to your question: How are you doing the free and reset of DALI? I was assuming that by overwritting the variables pipe and train_loader everything would be ok. Soo it's not an OOM issue. The code runs for 3 epochs and the crashes. Modifiying the code according to those posts:

                    pipe._pipe = None
                    del pipe
                    del train_loader
                    torch.cuda.empty_cache()
                    gc.collect()

Frees more memory in cuda1. Is there a way to tell DALI to ignore cuda0 (without using CUDA_VISIBLE_DEVICES since I need pytorch to be aware of it)

Some more weird data. If i reduce match size by 1/2 happens 1 epoch after. If I reduce it by 1/4 it happens in ht epoch 16.

So is there a way to pass CUDA_VISIBLE_DEVICES only to dali? (os.environ doesn't work for either torch or dali)

JuanFMontesinos commented 4 years ago

Another hint is that calling torch.backends.cudnn.enabled = False Solves the issue (but harms the speed)

BTW: Running everything in gpu0 solves the issue. But I don't really know why.

klecki commented 4 years ago

DALI should stick to using the device that you provided, we set the current device and reset it back with the https://github.com/NVIDIA/DALI/blob/master/include/dali/core/device_guard.h

Theoretically all calls should be using the device that you provided, so there should be no need to hide any devices from DALI.

Some more weird data. If i reduce match size by 1/2 happens 1 epoch after. If I reduce it by 1/4 it happens in ht epoch 16.

This looks like it's scaling with the batch size and would indicate some OOM. Are you sure you're not accumulating the data somewhere? If you somehow keep the tensors obtained from DALI they might not be freed.

Can you check your GPU memory occupancy? Something like:

nvidia-smi --query-gpu=name,memory.used,utilization.memory --format=csv -l 1

can be helpful to log the memory used by each gpu.

Can you try to prepare some minimal example that reproduces the issue so we could try analyzing that? The best would be if you could repro it just by passing some dummy data through a basic DALI pipeline and copy it to some simple torch nn for consumption (that probably needs to utilize cudnn somehow).

JuanFMontesinos commented 4 years ago

I'm gonna give up. I tried to reproduce it at the very beggining but it seems to involve cudnn. It's a bit strange that it runs perfectly (without strange paraphernalia like gc, and memo) if use cuda0 for both dali and pytorch. The gpu log is somehow reported in that table (that's gpu usage according to nvidia-smi at different times obtained by setting breakpoints) I've been looking at the gpu usage the whole time nvidia-smi -n 0 and it doesn't look like a OOM (it0s never closed to GPU max mem). I think there is some sort of bug with this hardware. I realize some time ago that for this computer allocating tensors: cuda1-> cuda0 was making tensors allocated in cuda:0 to be zeroed. I was using a by-pass by doing cuda1-->cpu-->cuda0 here but I'm wondering if it can be wrong too.

Anyway I can confirm that building the pipe requires 10 Mb from the non-used gpu. (nvidia-smi shows mem usage from 2Mb--> 12 Mb) and this doesn't occur if it's disabled via CUDA_VISIBLE_DEVICES

It's not DALI's question but is there any test I can run to check whether allocation between devices is OK?

Thank you for your time anyway If I discover something else I will let you know

JanuszL commented 4 years ago

Hi, I think the problem may come from memory fragmentation (but this is my guess).

Anyway I can confirm that building the pipe requires 10 Mb from the non-used gpu. (nvidia-smi shows mem usage from 2Mb--> 12 Mb) and this doesn't occur if it's disabled via CUDA_VISIBLE_DEVICES

Can you provide a minimal repro for that? Does that happen when you build a pipeline or create the iterator (when you create the iterator PyTorch may create some context on the GPU0 as we import it there)? Another thing that comes to my mind is that maybe some operator is misbehaving and allocates something in GPU 0?

klecki commented 4 years ago

I think there is some sort of bug with this hardware. I realize some time ago that for this computer allocating tensors: cuda1-> cuda0 was making tensors allocated in cuda:0 to be zeroed. I was using a by-pass by doing cuda1-->cpu-->cuda0 here but I'm wondering if it can be wrong too.

According to https://discuss.pytorch.org/t/how-does-pytorch-transfer-data-between-gpus/83954 .to('cuda:0') does a device to device copy of the tensor. Did you try that and does it produce any weird results? It's not a concrete idea, but maybe there is some race between copying D2D and data being produced on some Pytorch stream (which the 0s might suggest).

JuanFMontesinos commented 4 years ago

Hi, Thank you very much for helping.

import torch
a=torch.rand(7).cuda(1)
b=a.cuda() # Same effect with a.to('cuda:0')
torch.cuda.synchronize()
c=b.cpu()

print(a)
print(b)
print(b.sum())

So this seems to produce an issue:

tensor([0.8375, 0.2387, 0.0349, 0.9849, 0.6205, 0.1031, 0.8084],
       device='cuda:1')
tensor([0., 0., 0., 0., 0., 0., 0.], device='cuda:0')
tensor(0., device='cuda:0')

Process finished with exit code 0

I asked when I realized and they think it can be a hardware deffect. The issue doesn't occur the other way around (from gpu0 to gpu1)

Anyway I've been trying to reproduce it and I cannot. I roughly used the same operators en none of them seem to be the problem. I'll let you know if I discover anything else.

Thank you very much

JanuszL commented 4 years ago

As it is not DALI related you can try to report it to PyTorch developers and see how that goes.

JuanFMontesinos commented 4 years ago

Yep I know. Just mentioning since you may have hear of it. They have no clue and assume hardward error. Anyway thanks for your help