PyTorch Hugging Face Models do not have ACL calls for Docker versions > 23.05

ARM-software / Tool-Solutions

Tutorials & examples for Arm software development tools.

Apache License 2.0

253 stars 136 forks source link

PyTorch Hugging Face Models do not have ACL calls for Docker versions > 23.05 #200

Open abhishek-rn opened 1 year ago

abhishek-rn commented 1 year ago

Hi,

Docker Tags: r23.09-torch-2.0.0-onednn-acl r23.05-torch-2.0.0-onednn-acl

I am unable to get acl calls in docker versions higher than 23.05 for Pytorch Hugging Face Models

Attaching oneDNN verbose calls for BERT model here 23.05_Bert_Verbose.txt 23.09_Bert_Verbose.txt

The code to reproduce this is attached as below: PyT_Bert_Training.txt --> Use this for the first run to generate necessary inference checkpoints and files. PyT_Bert_Inf.txt --> For subsequent runs to generate the oneDNN logs

Also, as a result, the later oneDNN verbose exhibits gemm:jit calls for Matmuls and this results in poor performance for inference compared to gemm:acl calls.

Thanks

nSircombe commented 1 year ago

Hi @abhishek-rn Thanks for the report. This transition from 23.05 to 23.06 marks the move from PyTorch 1.x to 2.x, so it looks like we may have lost some functionality at this stage. Would you be able to confirm if the same behaviour is present if you use the pip installed pytorch packages for 1.3 and 2.0 on AArch64, and also on x86?

abhishek-rn commented 1 year ago

Hi @nSircombe The Docker tag read r23.05-torch-2.0.0-onednn-acl. So, I thought that would mean torch-2.0.0. However, I ran the pip installed pytorch 2.0.0 and 1.13 and PFB the logs: ARM_PyT_1.13_Bert_Verbose.txt ARM_PyT_2.0.0_Bert_Verbose.txt

And the results there show that PyT 1.13 has no ACL calls but PyT 2.0.0 has.

x86_PyT_1.13_Bert_Verbose.txt x86_PyT_2.0.0_Bert_Verbose.txt

Also, x86 PyTorch do not have oneDNN calls for Matmuls as seen in the above logs

nSircombe commented 1 year ago

Yes you're right, the version is 2.0. The tag is correct - matches the version in the Dockerfile. The mistake is in the README for the 23.05 increment here which still has 1.3.