NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.15k stars 621 forks source link

Inplace operator support #4533

Open appearancefnp opened 1 year ago

appearancefnp commented 1 year ago

Hello, I wanted to ask whether it is possible to create in place operations. I have a pretty big DALI pipeline (in terms of image size) and I have to preprocess data, but each operation creates a copy of the data, that results in a DALI preprocessing pipeline with around 8Gb of memory consumption.

DALI version: 1.22.0dev

My neural network has an input size of 3 images with batchx3x5000x10000.

The pipeline consists of these steps:

  1. 3 Encoded 16-bit TIFF images (900Mb)
  2. nvidia.dali.fn.experimental.decoders.image (900Mb)
  3. nvidia.dali.fn.transpose (900Mb)
  4. nvidia.dali.fn.cast (1'800Mb)
  5. division operator (1'800Mb)
  6. nvidia.dali.fn.stack (1'800Mb)

Which takes around 8.1Gb of GPU memory just for pre-processing.

I am using DALI with Triton Inference Server and this is an issue because the TensorRT model is only around 1Gb memory and the pre-processing is 8x bigger. If some of the operations would be inplace it would greatly imporve the memory usage server-side. Is there a plan or a way to enable this?

Thanks in advance

JanuszL commented 1 year ago

Hi @appearancefnp,

Thank you for rising this topic. Currently, the only inplace operators that are supported are one called pass-through that change only the metadata but not the underlying memory (like reshape). We plan to reduce the memory usage inside the pipeline by reusing the memory that is no longer needed by the operators that have been executed already. @mzient can provide more details

mzient commented 1 year ago

Hello @appearancefnp

There's no immediate plan to support in-place operators - this has been considered, but even if they are ever supported, most of the operators you mentioned are not amenable for in-place execution - we've only ever considered it for pointwise operations which do not change the element size - things like arithmetic operators, color space conversion (assuming that the number of channels is preserved), brightness/contrast adjustment, affine transforms of point clouds (without projection / immersion), etc.

Having said that, we do plan to have memory reuse along the pipeline - that is, when tensors are no longer used, they will be returned to the memory pool for immediate reuse. In this case you'd get something like:

step 1 2 3 4
input 1 300      
input 2 300      
input 3 300      
decoded 1 300 300    
decoded 2 300 300    
decoded 3 300 300    
transpose 1   300 300  
transpose 2   300 300  
transpose 3   300 300  
cast 1     600 600
cast 2     600 600
cast 3     600 600
stack       1800
total 1800 1800 2700 3600

So, the maximum amount of memory required would be 3.6 GB - or possibly 4.5 GB if we don't own the input buffers.

BTW - what do you need the transpose for? If I understand correctly, there are three channels stored separately - in that case, you can simply reinterpret the data as channel-first (or channel-less).