ACL implementation for fp32 to bf16 conversion and weights pre-packing (reordering) is missing

snadampal commented 1 year ago

Output of 'strings libarm_compute.so | grep arm_compute_version':

Used the Compute Library from TensorFlow 2.12 release. ComputeLibrary-22.11 But TensorFlow carried ACL patches for fixed format kernels, so, the ACL codebase is close to ACL23.05

Platform: AWS Graviton3 based c7g.8xl

Operating System: Ubuntu 20.04

Expected behavior is ACL to provide Neon or SVE optimized weights reorder function for fastmath kernels (fp32->bf16 accelerated kernels)

Problem description: Currently there is no ACL implementation for fp32 to bf16 weights conversion and reordering kernel. So for fastmath kernels, we are using the jitted version of reorder functions from frameworks like oneDNN. But this is adding significant overhead from reorders which is outweighing the performance advantage from ACL matmul gemm kernels. The request is to provide Neon or SVE optimized version of fp32 to bf16 conversion and reordering kernel.

The following dnnl logs are for TensorFlow inference for Bert Sentiment analysis model. This shows that jitted reorder latencies (especially for the fully connected layer weights) are almost double the latency for the actual matmul gemm kernels.

onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:ab:f0 dst_bf16::blocked:BA4b4a:f0,attr-fpmath:bf16 ,,1024x1024,0.843018
onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:ab:f0 dst_bf16::blocked:BA4b4a:f0,attr-fpmath:bf16 ,,1024x1024,0.850098
onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:ab:f0 dst_bf16::blocked:BA4b4a:f0,attr-fpmath:bf16 ,,1024x1024,0.845947
onednn_verbose,exec,cpu,matmul,gemm:acl,undef,src_f32::blocked:ab:f0 wei_bf16::blocked:BA4b4a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user attr-fpmath:bf16 ,,128x1024:1024x1024:128x1024,0.729004
onednn_verbose,exec,cpu,matmul,gemm:acl,undef,src_f32::blocked:ab:f0 wei_bf16::blocked:BA4b4a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user attr-fpmath:bf16 ,,128x1024:1024x1024:128x1024,0.718018
onednn_verbose,exec,cpu,matmul,gemm:acl,undef,src_f32::blocked:ab:f0 wei_bf16::blocked:BA4b4a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user attr-fpmath:bf16 ,,128x1024:1024x1024:128x1024,0.73999
onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:ab:f0 dst_bf16::blocked:BA4b4a:f0,attr-fpmath:bf16 ,,1024x4096,3.78198
onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abdc:f0 dst_bf16::blocked:abDC8d4c:f0,attr-fpmath:bf16 ,,1x16x64x128,0.0649414
onednn_verbose,exec,cpu,matmul,gemm:acl,undef,src_f32::blocked:abcd:f0 wei_bf16::blocked:abDC8d4c:f0 dst_f32::blocked:abcd:f0,attr-scratchpad:user attr-fpmath:bf16 ,,1x16x128x64:1x16x64x128:1x16x128x128,    0.132812
onednn_verbose,exec,cpu,softmax_v2,acl,forward_inference,src_f32::blocked:abcd:f0 dst_f32::blocked:abcd:f0,attr-fpmath:bf16 ,alg:softmax_accurate axis:3,1x16x128x128,0.403076
onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abdc:f0 dst_bf16::blocked:abDC4d4c:f0,attr-fpmath:bf16 ,,1x16x128x64,0.0610352
onednn_verbose,exec,cpu,matmul,gemm:acl,undef,src_f32::blocked:abcd:f0 wei_bf16::blocked:abDC4d4c:f0 dst_f32::blocked:abcd:f0,attr-scratchpad:user attr-fpmath:bf16 ,,1x16x128x128:1x16x128x64:1x16x128x64,    0.145996
onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:ab:f0 dst_bf16::blocked:BA4b4a:f0,attr-fpmath:bf16 ,,1024x1024,0.796875
onednn_verbose,exec,cpu,matmul,gemm:acl,undef,src_f32::blocked:ab:f0 wei_bf16::blocked:BA4b4a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user attr-fpmath:bf16 ,,128x1024:1024x4096:128x4096,2.62402
onednn_verbose,exec,cpu,matmul,gemm:acl,undef,src_f32::blocked:ab:f0 wei_bf16::blocked:BA4b4a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user attr-fpmath:bf16 ,,128x1024:1024x1024:128x1024,0.680908
onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:ab:f0 dst_bf16::blocked:BA4b4a:f0,attr-fpmath:bf16 ,,1024x4096,3.87988
onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:ab:f0 dst_bf16::blocked:BA8b4a:f0,attr-fpmath:bf16 ,,4096x1024,5.1709
onednn_verbose,exec,cpu,matmul,gemm:acl,undef,src_f32::blocked:ab:f0 wei_bf16::blocked:BA4b4a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user attr-fpmath:bf16 ,,128x1024:1024x4096:128x4096,2.77515
onednn_verbose,exec,cpu,matmul,gemm:acl,undef,src_f32::blocked:ab:f0 wei_bf16::blocked:BA8b4a:f0 dst_f32::blocked:ab:f0,attr-scratchpad:user attr-fpmath:bf16 ,,128x4096:4096x1024:128x1024,2.35107

To reproduce the behavior, please run inference for any TF BERT Sentiment Analysis model from Hugging Face or TF hub.

snadampal commented 1 year ago

Hi @morgolock, can you please let me know when ACL 23.08 is coming? it would be great if you could share any details on this BF16 reorder feature. Thank you!

morgolock commented 1 year ago

Hi @snadampal

23.08 is going to be out in the next few days.

We are going to look into your request to provide a function to convert and reorder the weight.

snadampal commented 1 year ago

thanks for the update @morgolock . Does it mean, the BF16 reorder feature is not going to be part of ACL23.08, but the next release?

morgolock commented 1 year ago

Hi @snadampal

That's correct, it won't be present in 23.08.

snadampal commented 1 year ago

Hi @morgolock , can I expect it in ACL 23.11?

morgolock commented 1 year ago

Hi @snadampal

We need to look at this and have a clear idea of the performance gains that this work will bring. It's likely to be present in 24.02.

In the figures you shared above, could you please let us know on what system you run the model? What's the unit used in the last value for each row? ms?

Hope this helps

snadampal commented 1 year ago

Hi @morgolock , the platform was c7g.8xl, and the execution latency is in milli seconds. the dnnl logs also have the tensor shape info.

Platform: AWS Graviton3 based c7g.8xl

Operating System: Ubuntu 20.04

renato-arantes commented 1 year ago

Hi @snadampal,

We have a PoC implementation for reorders in ACL from FP32->BF16 that you can find here and is under review.

We have tested with NLP models that we have access to and for deberta-large we saw a 10% performance improvement on 8 threads on AWS Graviton3, where two reorders (from ab to BA8b4a and from ab to BA4b4a) for sizes 1024x1024 are 2.4 times faster than the jitted version from oneDNN.

Could you please check this PR on your side?

snadampal commented 12 months ago

Thanks for the update, @renato-arantes , I will give it a try.

renato-arantes commented 9 months ago

Hi @snadampal,

Our work on FP32->BF16 reorder has been merged in ACL here and in OneDNN here. Do you have any feedback to provide? Can we close this issue?

snadampal commented 9 months ago

Hi @renato-arantes , sorry for the delay, will try to provide feedback by end of this week. btw, these changes look different from the patches you've been maintaining in arm-tool-solutions (here), are you panning to update them to have a arm docker build for validation? I will try to take the merged PRs directly to latest TensorFlow and see if these are compatible.

snadampal commented 9 months ago

Hi @renato-arantes, please go ahead and close this issue. I will add my observations once these are part of TF. thank you!

ARM-software / ComputeLibrary

ACL implementation for fp32 to bf16 conversion and weights pre-packing (reordering) is missing #1060