Closed snadampal closed 9 months ago
Hi @morgolock, can you please let me know when ACL 23.08 is coming? it would be great if you could share any details on this BF16 reorder feature. Thank you!
Hi @snadampal
23.08 is going to be out in the next few days.
We are going to look into your request to provide a function to convert and reorder the weight.
thanks for the update @morgolock . Does it mean, the BF16 reorder feature is not going to be part of ACL23.08, but the next release?
Hi @snadampal
That's correct, it won't be present in 23.08.
Hi @morgolock , can I expect it in ACL 23.11?
Hi @snadampal
We need to look at this and have a clear idea of the performance gains that this work will bring. It's likely to be present in 24.02.
In the figures you shared above, could you please let us know on what system you run the model? What's the unit used in the last value for each row? ms?
Hope this helps
Hi @morgolock , the platform was c7g.8xl, and the execution latency is in milli seconds. the dnnl logs also have the tensor shape info.
Platform: AWS Graviton3 based c7g.8xl
Operating System: Ubuntu 20.04
Hi @snadampal,
We have a PoC implementation for reorders in ACL from FP32->BF16 that you can find here and is under review.
We have tested with NLP models that we have access to and for deberta-large
we saw a 10% performance improvement on 8 threads on AWS Graviton3, where two reorders (from ab to BA8b4a and from ab to BA4b4a) for sizes 1024x1024 are 2.4 times faster than the jitted version from oneDNN.
Could you please check this PR on your side?
Thanks for the update, @renato-arantes , I will give it a try.
Hi @renato-arantes , sorry for the delay, will try to provide feedback by end of this week. btw, these changes look different from the patches you've been maintaining in arm-tool-solutions (here), are you panning to update them to have a arm docker build for validation? I will try to take the merged PRs directly to latest TensorFlow and see if these are compatible.
Hi @renato-arantes, please go ahead and close this issue. I will add my observations once these are part of TF. thank you!
Output of 'strings libarm_compute.so | grep arm_compute_version':
Used the Compute Library from TensorFlow 2.12 release. ComputeLibrary-22.11 But TensorFlow carried ACL patches for fixed format kernels, so, the ACL codebase is close to ACL23.05
Platform: AWS Graviton3 based c7g.8xl
Operating System: Ubuntu 20.04
Expected behavior is ACL to provide Neon or SVE optimized weights reorder function for fastmath kernels (fp32->bf16 accelerated kernels)
Problem description: Currently there is no ACL implementation for fp32 to bf16 weights conversion and reordering kernel. So for fastmath kernels, we are using the jitted version of reorder functions from frameworks like oneDNN. But this is adding significant overhead from reorders which is outweighing the performance advantage from ACL matmul gemm kernels. The request is to provide Neon or SVE optimized version of fp32 to bf16 conversion and reordering kernel.
The following dnnl logs are for TensorFlow inference for Bert Sentiment analysis model. This shows that jitted reorder latencies (especially for the fully connected layer weights) are almost double the latency for the actual matmul gemm kernels.
To reproduce the behavior, please run inference for any TF BERT Sentiment Analysis model from Hugging Face or TF hub.