Open appearancefnp opened 1 year ago
Hi @appearancefnp,
Thank you for rising this topic.
Currently, the only inplace operators that are supported are one called pass-through
that change only the metadata but not the underlying memory (like reshape
).
We plan to reduce the memory usage inside the pipeline by reusing the memory that is no longer needed by the operators that have been executed already. @mzient can provide more details
Hello @appearancefnp
There's no immediate plan to support in-place operators - this has been considered, but even if they are ever supported, most of the operators you mentioned are not amenable for in-place execution - we've only ever considered it for pointwise operations which do not change the element size - things like arithmetic operators, color space conversion (assuming that the number of channels is preserved), brightness/contrast adjustment, affine transforms of point clouds (without projection / immersion), etc.
Having said that, we do plan to have memory reuse along the pipeline - that is, when tensors are no longer used, they will be returned to the memory pool for immediate reuse. In this case you'd get something like:
step | 1 | 2 | 3 | 4 |
---|---|---|---|---|
input 1 | 300 | |||
input 2 | 300 | |||
input 3 | 300 | |||
decoded 1 | 300 | 300 | ||
decoded 2 | 300 | 300 | ||
decoded 3 | 300 | 300 | ||
transpose 1 | 300 | 300 | ||
transpose 2 | 300 | 300 | ||
transpose 3 | 300 | 300 | ||
cast 1 | 600 | 600 | ||
cast 2 | 600 | 600 | ||
cast 3 | 600 | 600 | ||
stack | 1800 | |||
total | 1800 | 1800 | 2700 | 3600 |
So, the maximum amount of memory required would be 3.6 GB - or possibly 4.5 GB if we don't own the input buffers.
BTW - what do you need the transpose
for? If I understand correctly, there are three channels stored separately - in that case, you can simply reinterpret the data as channel-first (or channel-less).
Hello, I wanted to ask whether it is possible to create in place operations. I have a pretty big DALI pipeline (in terms of image size) and I have to preprocess data, but each operation creates a copy of the data, that results in a DALI preprocessing pipeline with around 8Gb of memory consumption.
DALI version: 1.22.0dev
My neural network has an input size of 3 images with batchx3x5000x10000.
The pipeline consists of these steps:
Which takes around 8.1Gb of GPU memory just for pre-processing.
I am using DALI with Triton Inference Server and this is an issue because the TensorRT model is only around 1Gb memory and the pre-processing is 8x bigger. If some of the operations would be inplace it would greatly imporve the memory usage server-side. Is there a plan or a way to enable this?
Thanks in advance