[Performance] Performance degradation with ZenDNN

aj-prime commented 5 months ago

Describe the issue

I followed the installation instructions described in the section 4 of the README.

Processor Name: AMD EPYC 7V13 64-Core Processor (Azure Cloud)

Performance (QPS): ZenDNN: 34 CPU: 77

To reproduce

CPU: python -m onnxruntime.transformers.benchmark -m bert-large-uncased --model_class AutoModel -p fp32 -i 3 -t 10 -b 24 -s 16 -n 96 -v --provider cpu

ZenDNN: python -m onnxruntime.transformers.benchmark -m bert-large-uncased --model_class AutoModel -p fp32 -i 3 -t 10 -b 24 -s 16 -n 96 -v --provider zendnn

I tried 64 threads also, but it results in worse performance.

Urgency

No response

Platform

Linux

OS Version

Ubuntu 20.04.4 LTS

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-zendnn:1.17.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Unknown

ajeet1203singh commented 5 months ago

Hello @aj-prime, can you please let us know the environment variables you are using and can you confirm that "Optimal Memory Allocator Settings Specific to ONNXRT" section in the user guide was followed?

For this case our recommended settings are: export GOMP_CPU_AFFINITY=0-63 && export OMP_NUM_THREADS=64 && export OMP_WAIT_POLICY=ACTIVE && export OMP_PROC_BIND=FALSE && export OMP_DYNAMIC=FALSE && export ZENDNN_MATMUL_ALGO=FP32:4 && export LD_PRELOAD=$ZENDNN_PARENT_FOLDER/openmp-10.0.1.src/runtime/src/libomp.so

With thp setting as "always"

aj-prime commented 5 months ago

Thanks @ajeet1203singh. Using Optimal Memory Allocator Setting resolved the issue.

lauthu commented 3 months ago

Hello @ajeet1203singh and @aj-prime , can you please share some number on the expected throughput improvement of ZenDNN 4.2?

I'm also trying to run the transformer benchmark, and also got the similar result (ZenDNN is slower than CPU Execution Provider).

amd / ZenDNN-onnxruntime