PR to upstream Compute Library about newly added ArmComputeEx kernels

chunseoklee commented 4 years ago

How about make PRs of newly added ArmComputeEx kernels to upstream Compute Library ? IDK much about today's ArmComputeEx kernels. Let's discuss here.

chunseoklee commented 4 years ago

cc @ragmani

daeinki commented 4 years ago

One item for upstream is CLReductionOperation. Path : compute/ARMComputeEx/src/runtime/CL/functions/CLReduceOperation.cpp

Upstream version of ARMCL doesn't consider multiple axis values to this operation so we cannot inference in-house model(hand gesture) on GPU without modification. So I planned to fix this operation and upstream it but I found out that ONE have fixed it already so that multiple axis values are handled.

I'm not sure the update can manage all operations correctly such as ARG_IDX_MAX/MIN, MEAN_SUM, PROD, SUM_SQUARE, SUM, MIN and MAX handled by CLReductionOperation but if this update could be upstreamed to ARMCL mainline then it would be very helpful for reducing maintenance cost.

Thanks.

ragmani commented 4 years ago

This is simple explanation about ArmComputeEx kernels

Layer	The reason why it was introduced (v20.05)	Is the problem resolved on upstream of Compute Library?	Additional explanation	Author
`CLArgMinMaxLayerEx`	To fix setting output type incorrectly	X		@dr-venkman
`CLBinaryLogicalOp`, `NEBinaryLogicalOp`	To support broadcasting. `CLLogicalAnd` and `NELogicalAnd` don't support broadcasting	X		sec@prasannar
`CLCastBool`, `NECastBool`	To fix different results compared to `TFLite interpreter`	X	`TFLite interpreter` converts bool type to 1, but `CLCast` and `NECast` of `ComputeLibrary` converts bool type to 255	@ragmani
`CLEmbeddingLookup`, `CLHashtableLookup`, `NEEmbeddingLookup`, `NEHashtableLookup`	There is no kernel for these operations	X	I don't know where these operations are needed	@ragmani
`CLFullyConnectedLayerEx`, `NEFullyConnectedLayerEx`	`CLFullyConnectedLayer` and `NEFullyConnectedLayer` don't support weights as inputs.	X	`CLFullyConnectedLayerEx` and `NEFullyConnectedLayerEx` just disable caching weights	@ragmani
`CLFullyConnectedHybridLayer`, `NEFullyConnectedHybridLayer`	To support hybrid quantization	X		@ragmani
`CLGatherEx`, `NEGatherEx`	To support 2D/3D indices	X		CLGatherEx : @jyoungyun , NEGatherEx : @ragmani
`CLInstanceNormalizationLayerKernelEx`, `NEInstanceNormalizationLayerKernelEx`	To support InstanceNormalization operation	O		@ragmani
`CLNeg`	To support int32 type. `CLElementWiseUnaryLayer` does not support int32 type	X		sec@prasannar
`CLOneHot`, `NEOneHot`	To support OneHot operation.	X		@ragmani
`CLReduceOperation`, `NEReduceOperation`, `NEReduceSum`	To support multiple axis values	X		@ragmani
`CLSplitVEx`	To support SplitV operation.	X		@dr-venkman
`CLTopKV2`	To support TopKV2 operation.	X	This is not CL kernel	-
`CLTransposeConvLayer`, `NETransposeConvLayer`	To support invalid padding.	X	`CLDeconvolutionLayer` and `NEDeconvolutionLayer` does not support invalid padding. For example, negative padding	@hseok-oh, @ragmani
`NEActivationLayerEx`	To fix a bug in Logistic operation	O(v20.05)	`NEActivationLayer` can generate produce erroneous results. It is caused by `vexpq_f32()`. The neon function returns a value outside of the limit of representation in float as `NaN` instead of `INF`, and then the result of this op will be errors due to the `NaN`	@ragmani
`CLReduceOperationKernel`	To fix a bug in ReduceSum operation	-		@ragmani

daeinki commented 4 years ago

@ragmani Thanks for sharing this.

As of now, ArmComputeEx of ONE has a big diff compared to mainline, and it will make more big diff as time goes on. Which in turn, it will result in the high update cost. So we may need to reduce the diff by settling them down on mainline from now.

ragmani commented 4 years ago

As of now, ArmComputeEx of ONE has a big diff compared to mainline, and it will make more big diff as time goes on. Which in turn, it will result in the high update cost. So we may need to reduce the diff by settling them down on mainline from now.

@daeinki The reason of that ArmComputeEx of ONE has big diffs is because ArmComputeEx was applied clang-format. I'm not sure but how about trying to apply https://github.com/ARM-software/ComputeLibrary/blob/master/.clang-format. It may reduce the diffs. Anyway I will try to contribute for merging some kernels to upstream repo. But I'm not sure if I succeed to merge changes to upstream repository,

daeinki commented 4 years ago

As of now, ArmComputeEx of ONE has a big diff compared to mainline, and it will make more big diff as time goes on. Which in turn, it will result in the high update cost. So we may need to reduce the diff by settling them down on mainline from now.

@daeinki The reason of that ArmComputeEx of ONE has big diffs is because ArmComputeEx was applied clang-format. I'm not sure but how about trying to apply https://github.com/ARM-software/ComputeLibrary/blob/master/.clang-format. It may reduce the diffs. Anyway I will try to contribute for merging some kernels to upstream repo. But I'm not sure if I succeed to merge changes to upstream repository,

What I said is not that just coding style difference in deed but is that the changes in ArmComputeEx couldn't be reused for upstream. Actually, I looked into a reduction layer of ArmComputeEx and I think you may need to re-implement this layer from scratch on top of mainline ARMCL because both have totally different code not coding style.

Just sharing my upstream work, I already started the upstream to several ops support for ARMNN, and int32 and int64 ArgMinMax op support has been merged to ARMNN mainline as of now. And, I'm preparing for other patch set which is to add a new CL layer support to ARMCL.

Upstream wouldn't be easy and will take much time but would be needed. TPL recommend strongly to upstream changes as possible because in past, we decided to clone several open source frameworks - Enlightenment, SystemD, and so on - due to upstream overhead. We didn't upstream changes just in time. This is a big one of stains Tizen has.

ragmani commented 4 years ago

@daeinki Thanks for your sharing. It was good information to upstream. Can I get the reason why you think I may need to re-implement reduction layer? I would like to re-implement the shortcomings of that layer. And could you share me about the upstream process and upstream overhead of Compute Library based on your experience? It would reduce overhead occurred by that I try to upstream.

daeinki commented 4 years ago

@ragmani

The reason I think is that you touched on mainline code too much so I pretty sure that as-is will be rejected. If you want to upstream as-is then you also have to post another patch for ARMNN because primary user of ARMCL is ARMNN because with as-is patch, ARMNN - primary user - never work. Even for upstream, you would need to discuss this patch with both sides - ARMCL and ARMNN engineers. Simply saying, your patch broken the compatibility.

I'd recommend you to keep original format (function parameters, class members) and coding style - as possible as you can when you make a patch for upstream.

And upstream process for ARMCL is pretty simple because they use gerrit review system which is same as Tizen. First of all, sign up you to the gerrit review system - https://review.mlplatform.org.

And just git clone https://git.mlplatform.org/ml/ComputeLibrary.git, make a new patch, and post it to the gerrit review system. To post a new patch, I use below commands,

add review.mlplatform.org as remote repo
git remote add gerrithost https://review.mlplatform.org/ml/ComputeLibrary
make a new patch you want to post
Post it if you are ready
git push gerrithost HEAD:refs/for/master

As for review, I will back you up.

Please feel free to request me anytime if you need some help.

ragmani commented 4 years ago

@daeinki Thanks for your information. I will make a simple patch first of all after completing our process.

Samsung / ONE

PR to upstream Compute Library about newly added ArmComputeEx kernels #4579

git remote add gerrithost https://review.mlplatform.org/ml/ComputeLibrary

git push gerrithost HEAD:refs/for/master