Inplace operator support

appearancefnp commented 1 year ago

Hello, I wanted to ask whether it is possible to create in place operations. I have a pretty big DALI pipeline (in terms of image size) and I have to preprocess data, but each operation creates a copy of the data, that results in a DALI preprocessing pipeline with around 8Gb of memory consumption.

DALI version: 1.22.0dev

My neural network has an input size of 3 images with batchx3x5000x10000.

The pipeline consists of these steps:

3 Encoded 16-bit TIFF images (900Mb)
nvidia.dali.fn.experimental.decoders.image (900Mb)
nvidia.dali.fn.transpose (900Mb)
nvidia.dali.fn.cast (1'800Mb)
division operator (1'800Mb)
nvidia.dali.fn.stack (1'800Mb)

Which takes around 8.1Gb of GPU memory just for pre-processing.

I am using DALI with Triton Inference Server and this is an issue because the TensorRT model is only around 1Gb memory and the pre-processing is 8x bigger. If some of the operations would be inplace it would greatly imporve the memory usage server-side. Is there a plan or a way to enable this?

Thanks in advance

JanuszL commented 1 year ago

Hi @appearancefnp,

Thank you for rising this topic. Currently, the only inplace operators that are supported are one called pass-through that change only the metadata but not the underlying memory (like reshape). We plan to reduce the memory usage inside the pipeline by reusing the memory that is no longer needed by the operators that have been executed already. @mzient can provide more details

mzient commented 1 year ago

Hello @appearancefnp

There's no immediate plan to support in-place operators - this has been considered, but even if they are ever supported, most of the operators you mentioned are not amenable for in-place execution - we've only ever considered it for pointwise operations which do not change the element size - things like arithmetic operators, color space conversion (assuming that the number of channels is preserved), brightness/contrast adjustment, affine transforms of point clouds (without projection / immersion), etc.

Having said that, we do plan to have memory reuse along the pipeline - that is, when tensors are no longer used, they will be returned to the memory pool for immediate reuse. In this case you'd get something like:

step	1	2	3	4
input 1	300
input 2	300
input 3	300
decoded 1	300	300
decoded 2	300	300
decoded 3	300	300
transpose 1		300	300
transpose 2		300	300
transpose 3		300	300
cast 1			600	600
cast 2			600	600
cast 3			600	600
stack				1800
total	1800	1800	2700	3600

So, the maximum amount of memory required would be 3.6 GB - or possibly 4.5 GB if we don't own the input buffers.

BTW - what do you need the transpose for? If I understand correctly, there are three channels stored separately - in that case, you can simply reinterpret the data as channel-first (or channel-less).

NVIDIA / DALI

Inplace operator support #4533