Open lilh9598 opened 1 year ago
Hi @lilh9598,
Yes you're entirely correct - there is currently no ACL acceleration for dnnl_sgemm
and it could be added.
At present it should be possible to use the Arm Performance Libraries CBLAS interface, via the oneDNN build option ONEDNN_BLAS_VENDOR
to provide optimised kernels for this route (see https://github.com/oneapi-src/oneDNN/blob/3db871474283976ef5e80f6fecbb10e8856100c7/doc/build/build_options.md).
The integration of mutmul support via ACL was motivated by the ops used for a small number of use cases which we had been exposed to, the BERT-large network from MLCommons for example. This lead to the integration of ACL matmul support via mkl_batch_matmul_op.cc
, for example.
Integration into dnnl_sgemm
is on our 'to do' list, but I don't have a time-scale at present. It will most likely appear as an experimental feature here, in Tool Solutions, ahead of being available upstream though.
One issue with using ACL for sgemm support though, is that ACL is not a "BLAS" library, and does not provide all the functionality expected in one. In particular, there's no support for transpose at present. So supporting dnnl_sgemm
via ACL would require a workaround - possibly rejecting the T
case, or adding a re-order, via oneDNN's re-order primitive, before passing to ACL. In the longer term, there's a case for adding T
support to ACL, but there are no firm plans at present to do this.
Could I ask what models you're interested in, where you think the dnnl_sgemm
support would bring tangible benefits? This would help us to understand the potential requirement for dnnl_sgemm
support, and provide valuable test/benchmark cases.
Hi @nSircombe ,
Thanks for your detailed reply!
Could I ask what models you're interested in, where you think the dnnl_sgemm support would bring tangible benefits? This would help us to understand the potential requirement for dnnl_sgemm support, and provide valuable test/benchmark cases.
I think there is a same problem in the transformer block. In it's structure, the subsequent node of matmul(with constant weight) is not always biasadd, which could not be converted into _fusedmatmul(acl opt). Maybe the unoptimized matmul can be solved through graph optimization in TensorFlow?
Hi @lilh9598,
Actually one of the Team has been looking at transformer models, and managed to spend a little time looking at the 'missing' gemm api
support. We have an approach not entirely dissimilar to your suggestion - diverting the call to batched matmul, where there's existing support via ACL. We're hoping to get a proof-of-concept patch into the build here soon, and will look to raise an upstream PR with TensorFlow if all looks good.
I'll let you know once we've got something up you can experiment with, any feedback would be very welcome.
Hi @nSircombe ,
Look really good! I am looking forward to applying your new experiments to my cases.
Hi @lilh9598,
@fadara01 has just upstreamed a patch here - https://github.com/ARM-software/Tool-Solutions/pull/174
This will pass calls heading into oneDNN's gemm_api
into BatchMatMulMkl
, which has ACL support. We've seen some significant performance improvements from this approach, and would be very interested to know if you get any gains (or problems) if you try and use it for your workflows.
Hi @nSircombe , I'm sorry that there is no performance improvement with the patch https://github.com/ARM-software/Tool-Solutions/pull/174 compared to eigen in my cpp-application.
And Is this patch enabled in the image “armswdev/tensorflow-arm-neoverse:latest”? I would like to code a python program for further performance profile.
Could you give more details about what you're running? OneDNN verbose logs would be handy - might help us spot missed opportunities to call ACL or highlight shapes where ACL performance is worse than expected.
This patch is not yet in the dockerhub images - they're updated from this repo once a month. The next update is due next week.
Our application is very complex, so I need extract the ml part to analyze first. And the bottleneck of that may be bandwidth. If any performance issues are found, we would be glad to sync with arm.
Ok @lilh9598 - would be interested to hear what you find.
On the matmul front though, just a redacted version of the stdout log generated with the environment variable ONEDNN_VERBOSE=1
set with & without the patch from https://github.com/ARM-software/Tool-Solutions/pull/174 included in the image build would give us valuable insight into whether the matmul changes in https://github.com/ARM-software/Tool-Solutions/pull/174 are being exercised by your work flow - if they're not, we may be able to identify more cases we could support; if they are, then we can look at the bare performance for the shapes in question and see if we're leaving anything on the table.
Look forward to hearing from you.
...I'll leave this issue open for now.
The code for calling the matmul of oneDNN in tensorflow.
The code of dnnl_sgemm in oneDNN
From the above code, we can see that if cblas is not enabled, an unoptimized matmul will be called by Tensorflow on aarch64, which will cause performance degradation. So, I think a fully optimized matmul of acl should be added to dnnl_sgemm to make full use of aarch64‘s isa and improve performance of mkl_matmul(a tf op) on aarch64.