NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.07k stars 615 forks source link

DALI Numba Plugin #2176

Open gartnera opened 4 years ago

gartnera commented 4 years ago

I've hacked together a quick prototype which demonstrates how you could use numba cfunc to process data.

numba is a jit to native compiler for python. The cfunc feature allows you to compile a function, get a pointer, and share that pointer with c/c++.

Here's an overview:

Change the function definition and see how min/max/std changes.

gartnera/dali-numba-plugin

I plan on developing this further, but I'm curious if this would be something you'd be interested in merging into the DALI repo when it was more mature.

klecki commented 4 years ago

Hi @gartnera, looks really cool, we would be happy to review and merge such contribution. It has a great potential of bridging the gap between custom Python Ops and native C++ Operators. Regards, Krzysztof.

JanuszL commented 4 years ago

I agree, it looks super cool and would be a fine extension to the Python function functionality.

banasraf commented 4 years ago

Hello @gartnera! That's an excellent idea for a contribution to DALI. Our current solutions for custom operators lack simplicity (writing a plugin) or performance (all kinds of PythonFunction operators). I imagine that in some cases, when user requires some very specific data augmentations in their pipelines, this feature might be game-changing.

It's great to hear you are interested in developing it further. I thought about some things that should be addressed to make it full-blown DALI operator:

Universality Such operator should handle varying number of inputs and outputs of different data types and shapes. It's not straightforward how to extend your prototype to support that. One thing is handling multiple inputs - that can be probably done by passing the data as void** (array of pointers to input samples). The same goes with passing shapes and dtypes. All of these could be maybe packed in some convenient structure.

Receiving outputs from the user defined cfunc is another story, though. In the prototype, an output is assumed to have the same size as an input and is preallocated. We would probably like to lift that assumption, so the question arises - how much memory should we allocate for the outputs of a custom cfunc? We could force a user to pass an argument that says how big are the outputs, but we perhaps can avoid that somehow. Maybe you have some ideas how to approach this? Let us know.

Although, we don't have to start with an operator that has all the features we can imagine. We can have an operator that covers only some of the use-cases but with a design that allows extending it in the future.

Simplicity This ought to be an operator that gives quite good performance, so we can sacrifice some of the straightforwardness compared to the PythonFunction operators but still we can make it as user-friendly as it's possible. For example, if we pass so many parameters to the custom function (data, shapes, dtypes) then maybe we can provide some helper functions that extract carrays from the raw arguments.

And again, making it super-easy to use might not be the first priority and it's ok to start with something more cumbersome to show that it's feasible to have such feature in DALI.

Anyway, even though it still needs further development to be sure how does it fit into DALI it seems to be very promising. I will be happy to provide you any help with making this operator (it should have a name - you can propose something) a part of DALI. Feel free to share any thoughts or questions.

Regards, Rafał

gartnera commented 4 years ago

Thanks for the thoughts/feedback. Lots of stuff I'm not sure how to do, so it will take a bit of fiddling with. Ultimately I'd like to see arbitrary dtype + shape input and output.

One thing is handling multiple inputs - that can be probably done by passing the data as void** (array of pointers to input samples).

Yeah I specifically want to reference both the data and the label in the function so I'll be trying to figure this out.

Receiving outputs from the user defined cfunc is another story, though. In the prototype, an output is assumed to have the same size as an input and is preallocated. We would probably like to lift that assumption, so the question arises - how much memory should we allocate for the outputs of a custom cfunc? We could force a user to pass an argument that says how big are the outputs, but we perhaps can avoid that somehow. Maybe you have some ideas how to approach this? Let us know.

It would be helpful if/when you want to convert precision. Maybe require the user to provide their c_sig to the operator for inspection. But that doesn't help if the num_elements changes. Maybe just some static factor (growth_multiplier=4 when converting from uint8 to uint64, growth_multiplier=3 when grayscale to RGB). But I'd also like the ability to reduce the size too, maybe a negative growth_multiplier (growth_multiplier=-3 when RGB to grayscale) or maybe a different variable (reduction_multiplier).

I don't see any way of malloc in a cfunc which would probably be the most robust as the user could just calculate the size themselves. Another thought is to have the user provide another cfunc with calculates the expected size.

it should have a name - you can propose something

Not sure if it should really be called NumbaOp because all it really does is call an arbitrary function pointer. But maybe if it becomes more dependent on numba features.

banasraf commented 4 years ago

I like the idea of a separate cfunc to determine output sizes. It's actually what our operators do in SetupImpl. Such cfunc could set the output shapes and dtypes and actually be called in the SetupImpl. Also if we would like to have an output preserving input shape but with another type, an argument for output_dtype would be enough. Probably it's a good idea to have multiple options to set the output size - from the simple and quite specific (output shape or/and output dtype parameters) to more convoluted but generic like a separate function.

Also, as you say, this operator's implementation is actually very generic because it just calls a function by pointer it got. If you define an API that such function should conform to, the use of this operator might go beyond just Numba cfunc.

jantonguirao commented 3 years ago

Hi @gartnera.

We wanted to let you know that we think that this feature will be very useful to our users and decided to start working on this. You can take a look at the work in progress in this PR, which is based in your original proposal. If you have any comments or suggestions we'd be very interested to hear.