Samsung / ONE

On-device Neural Engine
Other
439 stars 157 forks source link

PR to upstream Compute Library about newly added ArmComputeEx kernels #4579

Open chunseoklee opened 4 years ago

chunseoklee commented 4 years ago

How about make PRs of newly added ArmComputeEx kernels to upstream Compute Library ? IDK much about today's ArmComputeEx kernels. Let's discuss here.

chunseoklee commented 4 years ago

cc @ragmani

daeinki commented 4 years ago

One item for upstream is CLReductionOperation. Path : compute/ARMComputeEx/src/runtime/CL/functions/CLReduceOperation.cpp

Upstream version of ARMCL doesn't consider multiple axis values to this operation so we cannot inference in-house model(hand gesture) on GPU without modification. So I planned to fix this operation and upstream it but I found out that ONE have fixed it already so that multiple axis values are handled.

I'm not sure the update can manage all operations correctly such as ARG_IDX_MAX/MIN, MEAN_SUM, PROD, SUM_SQUARE, SUM, MIN and MAX handled by CLReductionOperation but if this update could be upstreamed to ARMCL mainline then it would be very helpful for reducing maintenance cost.

Thanks.

ragmani commented 4 years ago

This is simple explanation about ArmComputeEx kernels

Layer The reason why it was introduced (v20.05) Is the problem resolved on upstream of Compute Library? Additional explanation Author
CLArgMinMaxLayerEx To fix setting output type incorrectly X @dr-venkman
CLBinaryLogicalOp, NEBinaryLogicalOp To support broadcasting. CLLogicalAnd and NELogicalAnd don't support broadcasting X sec@prasannar
CLCastBool, NECastBool To fix different results compared to TFLite interpreter X TFLite interpreter converts bool type to 1, but CLCast and NECast of ComputeLibrary converts bool type to 255 @ragmani
CLEmbeddingLookup, CLHashtableLookup, NEEmbeddingLookup, NEHashtableLookup There is no kernel for these operations X I don't know where these operations are needed @ragmani
CLFullyConnectedLayerEx, NEFullyConnectedLayerEx CLFullyConnectedLayer and NEFullyConnectedLayer don't support weights as inputs. X CLFullyConnectedLayerEx and NEFullyConnectedLayerEx just disable caching weights @ragmani
CLFullyConnectedHybridLayer, NEFullyConnectedHybridLayer To support hybrid quantization X @ragmani
CLGatherEx, NEGatherEx To support 2D/3D indices X CLGatherEx : @jyoungyun , NEGatherEx : @ragmani
CLInstanceNormalizationLayerKernelEx, NEInstanceNormalizationLayerKernelEx To support InstanceNormalization operation O @ragmani
CLNeg To support int32 type. CLElementWiseUnaryLayer does not support int32 type X sec@prasannar
CLOneHot, NEOneHot To support OneHot operation. X @ragmani
CLReduceOperation, NEReduceOperation, NEReduceSum To support multiple axis values X @ragmani
CLSplitVEx To support SplitV operation. X @dr-venkman
CLTopKV2 To support TopKV2 operation. X This is not CL kernel -
CLTransposeConvLayer, NETransposeConvLayer To support invalid padding. X CLDeconvolutionLayer and NEDeconvolutionLayer does not support invalid padding. For example, negative padding @hseok-oh, @ragmani
NEActivationLayerEx To fix a bug in Logistic operation O(v20.05) NEActivationLayer can generate produce erroneous results. It is caused by vexpq_f32(). The neon function returns a value outside of the limit of representation in float as NaN instead of INF, and then the result of this op will be errors due to the NaN @ragmani
CLReduceOperationKernel To fix a bug in ReduceSum operation - @ragmani
daeinki commented 4 years ago

@ragmani Thanks for sharing this.

As of now, ArmComputeEx of ONE has a big diff compared to mainline, and it will make more big diff as time goes on. Which in turn, it will result in the high update cost. So we may need to reduce the diff by settling them down on mainline from now.

ragmani commented 4 years ago

As of now, ArmComputeEx of ONE has a big diff compared to mainline, and it will make more big diff as time goes on. Which in turn, it will result in the high update cost. So we may need to reduce the diff by settling them down on mainline from now.

@daeinki The reason of that ArmComputeEx of ONE has big diffs is because ArmComputeEx was applied clang-format. I'm not sure but how about trying to apply https://github.com/ARM-software/ComputeLibrary/blob/master/.clang-format. It may reduce the diffs. Anyway I will try to contribute for merging some kernels to upstream repo. But I'm not sure if I succeed to merge changes to upstream repository,

daeinki commented 4 years ago

As of now, ArmComputeEx of ONE has a big diff compared to mainline, and it will make more big diff as time goes on. Which in turn, it will result in the high update cost. So we may need to reduce the diff by settling them down on mainline from now.

@daeinki The reason of that ArmComputeEx of ONE has big diffs is because ArmComputeEx was applied clang-format. I'm not sure but how about trying to apply https://github.com/ARM-software/ComputeLibrary/blob/master/.clang-format. It may reduce the diffs. Anyway I will try to contribute for merging some kernels to upstream repo. But I'm not sure if I succeed to merge changes to upstream repository,

What I said is not that just coding style difference in deed but is that the changes in ArmComputeEx couldn't be reused for upstream. Actually, I looked into a reduction layer of ArmComputeEx and I think you may need to re-implement this layer from scratch on top of mainline ARMCL because both have totally different code not coding style.

Just sharing my upstream work, I already started the upstream to several ops support for ARMNN, and int32 and int64 ArgMinMax op support has been merged to ARMNN mainline as of now. And, I'm preparing for other patch set which is to add a new CL layer support to ARMCL.

Upstream wouldn't be easy and will take much time but would be needed. TPL recommend strongly to upstream changes as possible because in past, we decided to clone several open source frameworks - Enlightenment, SystemD, and so on - due to upstream overhead. We didn't upstream changes just in time. This is a big one of stains Tizen has.

ragmani commented 4 years ago

@daeinki Thanks for your sharing. It was good information to upstream. Can I get the reason why you think I may need to re-implement reduction layer? I would like to re-implement the shortcomings of that layer. And could you share me about the upstream process and upstream overhead of Compute Library based on your experience? It would reduce overhead occurred by that I try to upstream.

daeinki commented 4 years ago

@ragmani

The reason I think is that you touched on mainline code too much so I pretty sure that as-is will be rejected. If you want to upstream as-is then you also have to post another patch for ARMNN because primary user of ARMCL is ARMNN because with as-is patch, ARMNN - primary user - never work. Even for upstream, you would need to discuss this patch with both sides - ARMCL and ARMNN engineers. Simply saying, your patch broken the compatibility.

I'd recommend you to keep original format (function parameters, class members) and coding style - as possible as you can when you make a patch for upstream.

And upstream process for ARMCL is pretty simple because they use gerrit review system which is same as Tizen. First of all, sign up you to the gerrit review system - https://review.mlplatform.org.

And just git clone https://git.mlplatform.org/ml/ComputeLibrary.git, make a new patch, and post it to the gerrit review system. To post a new patch, I use below commands,

  1. add review.mlplatform.org as remote repo

    git remote add gerrithost https://review.mlplatform.org/ml/ComputeLibrary

  2. make a new patch you want to post
  3. Post it if you are ready

    git push gerrithost HEAD:refs/for/master

As for review, I will back you up.

Please feel free to request me anytime if you need some help.

ragmani commented 4 years ago

@daeinki Thanks for your information. I will make a simple patch first of all after completing our process.