When oneDNN is enabled, an unoptimized matmul is called by Tensorflow on aarch64

lilh9598 commented 1 year ago

From the above code, we can see that if cblas is not enabled, an unoptimized matmul will be called by Tensorflow on aarch64, which will cause performance degradation. So, I think a fully optimized matmul of acl should be added to dnnl_sgemm to make full use of aarch64‘s isa and improve performance of mkl_matmul(a tf op) on aarch64.

nSircombe commented 1 year ago

Hi @lilh9598,

Yes you're entirely correct - there is currently no ACL acceleration for dnnl_sgemm and it could be added. At present it should be possible to use the Arm Performance Libraries CBLAS interface, via the oneDNN build option ONEDNN_BLAS_VENDOR to provide optimised kernels for this route (see https://github.com/oneapi-src/oneDNN/blob/3db871474283976ef5e80f6fecbb10e8856100c7/doc/build/build_options.md).

The integration of mutmul support via ACL was motivated by the ops used for a small number of use cases which we had been exposed to, the BERT-large network from MLCommons for example. This lead to the integration of ACL matmul support via mkl_batch_matmul_op.cc, for example.

Integration into dnnl_sgemm is on our 'to do' list, but I don't have a time-scale at present. It will most likely appear as an experimental feature here, in Tool Solutions, ahead of being available upstream though.

One issue with using ACL for sgemm support though, is that ACL is not a "BLAS" library, and does not provide all the functionality expected in one. In particular, there's no support for transpose at present. So supporting dnnl_sgemm via ACL would require a workaround - possibly rejecting the T case, or adding a re-order, via oneDNN's re-order primitive, before passing to ACL. In the longer term, there's a case for adding T support to ACL, but there are no firm plans at present to do this.

Could I ask what models you're interested in, where you think the dnnl_sgemm support would bring tangible benefits? This would help us to understand the potential requirement for dnnl_sgemm support, and provide valuable test/benchmark cases.

lilh9598 commented 1 year ago

Hi @nSircombe ,

Thanks for your detailed reply!

Could I ask what models you're interested in, where you think the dnnl_sgemm support would bring tangible benefits? This would help us to understand the potential requirement for dnnl_sgemm support, and provide valuable test/benchmark cases.

I think there is a same problem in the transformer block. In it's structure, the subsequent node of matmul(with constant weight) is not always biasadd, which could not be converted into _fusedmatmul(acl opt). Maybe the unoptimized matmul can be solved through graph optimization in TensorFlow？

nSircombe commented 1 year ago

Hi @lilh9598,

Actually one of the Team has been looking at transformer models, and managed to spend a little time looking at the 'missing' gemm api support. We have an approach not entirely dissimilar to your suggestion - diverting the call to batched matmul, where there's existing support via ACL. We're hoping to get a proof-of-concept patch into the build here soon, and will look to raise an upstream PR with TensorFlow if all looks good. I'll let you know once we've got something up you can experiment with, any feedback would be very welcome.

lilh9598 commented 1 year ago

Hi @nSircombe ,

Look really good! I am looking forward to applying your new experiments to my cases.

nSircombe commented 1 year ago

Hi @lilh9598, @fadara01 has just upstreamed a patch here - https://github.com/ARM-software/Tool-Solutions/pull/174 This will pass calls heading into oneDNN's gemm_api into BatchMatMulMkl, which has ACL support. We've seen some significant performance improvements from this approach, and would be very interested to know if you get any gains (or problems) if you try and use it for your workflows.

lilh9598 commented 1 year ago

Hi @nSircombe , I'm sorry that there is no performance improvement with the patch https://github.com/ARM-software/Tool-Solutions/pull/174 compared to eigen in my cpp-application.

And Is this patch enabled in the image “armswdev/tensorflow-arm-neoverse:latest”? I would like to code a python program for further performance profile.

nSircombe commented 1 year ago

Could you give more details about what you're running? OneDNN verbose logs would be handy - might help us spot missed opportunities to call ACL or highlight shapes where ACL performance is worse than expected.

This patch is not yet in the dockerhub images - they're updated from this repo once a month. The next update is due next week.

lilh9598 commented 1 year ago

Our application is very complex, so I need extract the ml part to analyze first. And the bottleneck of that may be bandwidth. If any performance issues are found, we would be glad to sync with arm.

nSircombe commented 1 year ago

Ok @lilh9598 - would be interested to hear what you find.

On the matmul front though, just a redacted version of the stdout log generated with the environment variable ONEDNN_VERBOSE=1 set with & without the patch from https://github.com/ARM-software/Tool-Solutions/pull/174 included in the image build would give us valuable insight into whether the matmul changes in https://github.com/ARM-software/Tool-Solutions/pull/174 are being exercised by your work flow - if they're not, we may be able to identify more cases we could support; if they are, then we can look at the bare performance for the shapes in question and see if we're leaving anything on the table.

Look forward to hearing from you.

...I'll leave this issue open for now.

ARM-software / Tool-Solutions

When oneDNN is enabled, an unoptimized matmul is called by Tensorflow on aarch64 #168