Samsung / ONE

On-device Neural Engine
Other
437 stars 158 forks source link

[onert] Hidden switching mechanism for on-device compiler #13288

Open hseok-oh opened 4 months ago

hseok-oh commented 4 months ago

What?

We can think hidden switching mechanism: allocate backend automatically for user requirement such as best performance or best memory usage. Then runtime need assumption mechanism to allocate backend based on target machine environment. And runtime should quantize and run codegen automatically based on backend allocation.

As a first step, we can think on-device compiler's hidden switching mechanism: Model is allocated to NPU backend by user's API call, and that NPU is requiring specific quantization and code generation. So runtime should use on-device compiler to get NPU's binary by using on-device compiler's quantizer and code generator automatically.

Why?

It will help user to share same and simple API usage on many different target environment.


I watched your draft, if I understand correctly this draft fully covers the hidden switching mechanism for odc? Could you clarify, please, are there some uncovered tasks which can be resolved by me?

Originally posted by @Torrero in https://github.com/Samsung/ONE/issues/12903#issuecomment-2179225952

Torrero commented 4 months ago

What?

We can think hidden switching mechanism: allocate backend automatically for user requirement such as best performance or best memory usage. Then runtime need assumption mechanism to allocate backend based on target machine environment. And runtime should quantize and run codegen automatically based on backend allocation.

As a first step, we can think on-device compiler's hidden switching mechanism: Model is allocated to NPU backend by user's API call, and that NPU is requiring specific quantization and code generation. So runtime should use on-device compiler to get NPU's binary by using on-device compiler's quantizer and code generator automatically.

In this case, runtime should know about preferable quantization options for target backend, or user will provide this information? I think, there should be an option something like using hidden switching mechanism (maybe it should be add to ExecutionOptions) when runtime tries to get a NPU's binary automatically.
Also there should be a default number of the minmax statistics collection, the necessary count of inferences for quantization . After quantization, there should be accuracy comparison of the fcircle and qcircle, it should be executed automatically after quantization for the accuracy degradation identification, and if it is within the acceptable limit (default option or provided by user), we can continue the code generation. Otherwise we should provide information about failed quantization process to user.

Torrero commented 3 months ago

@hseok-oh Hello,

I prepared preliminary draft of the odc:Hidden switching mechanism could you review it, please. This is fcircle - qcircle step.
I tested it with conv2d model and also with mobilenet.