ROCm / rocWMMA

rocWMMA
https://rocm.docs.amd.com/projects/rocWMMA/
MIT License
88 stars 26 forks source link

MI100 performance #289

Closed briansp2020 closed 12 months ago

briansp2020 commented 12 months ago

What is the expected performance of MI100? I was expecting a much higher number since the theoretical performance is more than 180TF. I was getting higher numbers when I was testing 7900XTX even though it has a lower theoretical peak performance!

./perf_sgemm Initializing host data... Initializing device data... Launching GEMM kernel... gridDim (56 56) blockdim (128 2) TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s 128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 165.95, 736.587, 22.193 Finished! ./perf_hgemm Initializing host data... Initializing device data... Launching GEMM kernel... gridDim (56 56) blockdim (128 2) TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s 128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 78.1733, 736.587, 47.1124 Finished!

cgmillette commented 12 months ago

Hi @briansp2020, Thanks for reaching out!

Which release of ROCm are you using? Then I can see if I can reproduce the performance you are seeing on MI-100.

NB: For this particular sample, you have to be careful with supported block sizes on 7900XTX, as RDNA cards only support blockM/N of 16. The benchmark would run, but it won't validate successfully in debug mode. The challenge is that 'high performing' GEMMs may have different parameters on different architectures. This issue has been reported, and will be addressed in a future release.

cgmillette commented 12 months ago

By example, I just ran the sample on MI-100 around ROCm 5.6 release, which achieved close to 90 TFlops - appears typical for this release.

Initializing host data... Initializing device data... Launching GEMM kernel... gridDim (56 56) blockdim (128 2) TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s 128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 41.825, 736.587, 88.0557 Finished!

briansp2020 commented 12 months ago

@cgmillette Did you have to specify any parameters to get 90TF? I think I specified command line parameters when I ran the test on my 7900XTX before. Unfortunately, I forgot what the parameters were (I got them from an internet search and did not write it down. Doh!). Or are you supposed to get close to 90TF just by running perf_sgemm. I'm using a second hand MI100 I got off eBay in my 7900XT PC and built rocWMMA from git today.

Since I have your attention, I'd like to ask some questions. I'm trying to figure out how fast 7900XTX will eventually become when the software matures. I'm trying to compare it to MI100 since I'm assuming MI100 software support is mature and its theoretical fp16 peak number is similar. So far I have run some micro-benchmarks and am getting conflicting results.

Using TensorFlow, MI100 is much faster with CNNs

root@rocm:/root/benchmarks/scripts/tf_cnn_benchmarks# python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

TensorFlow: 2.13 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 128 global 128 per device Num batches: 100 Num epochs: 0.01 Devices: ['/gpu:0'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: parameter_server

Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 1135.2 +/- 0.0 (jitter = 0.0) 7.788 10 images/sec: 1138.1 +/- 1.0 (jitter = 3.5) 7.743 20 images/sec: 1138.7 +/- 0.7 (jitter = 4.3) 7.823 30 images/sec: 1138.5 +/- 0.5 (jitter = 3.4) 7.963 40 images/sec: 1138.2 +/- 0.4 (jitter = 2.4) 7.889 50 images/sec: 1137.9 +/- 0.4 (jitter = 2.4) 7.787 60 images/sec: 1137.7 +/- 0.4 (jitter = 2.4) 8.015 70 images/sec: 1137.3 +/- 0.4 (jitter = 2.9) 7.876 80 images/sec: 1137.1 +/- 0.3 (jitter = 2.9) 7.931 90 images/sec: 1136.8 +/- 0.3 (jitter = 3.3) 7.734 100 images/sec: 1136.4 +/- 0.3 (jitter = 3.3) 7.987

total images/sec: 1136.18

compared to 7900XTX

python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --use_fp16=True --model=resnet50

Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 706.8 +/- 0.0 (jitter = 0.0) 7.444 10 images/sec: 703.8 +/- 1.9 (jitter = 1.8) 7.422 20 images/sec: 703.2 +/- 1.0 (jitter = 2.4) 7.468 30 images/sec: 703.1 +/- 0.8 (jitter = 3.2) 7.564 40 images/sec: 702.9 +/- 0.8 (jitter = 3.1) 7.518 50 images/sec: 703.0 +/- 0.6 (jitter = 3.1) 7.447 60 images/sec: 703.4 +/- 0.6 (jitter = 3.1) 7.603 70 images/sec: 703.1 +/- 0.5 (jitter = 3.6) 7.516 80 images/sec: 703.0 +/- 0.5 (jitter = 3.7) 7.560 90 images/sec: 703.0 +/- 0.4 (jitter = 3.6) 7.433 100 images/sec: 702.9 +/- 0.4 (jitter = 3.7) 7.601

total images/sec: 702.71

But PyTorch micro bench gives the opposite result MI100

python micro_benchmarking_pytorch.py --network convnext_small --fp16 1 INFO: running forward and backward for warmup. INFO: running the benchmark.. OK: finished running benchmark.. --------------------SUMMARY-------------------------- Microbenchmark for network : convnext_small Num devices: 1 Dtype: FP16 Mini batch size [img] : 64 Time per mini-batch : 0.33329823017120364 Throughput [img/sec] : 192.02022155090785

7900XTX

python micro_benchmarking_pytorch.py --network convnext_small --fp16 1 INFO: running forward and backward for warmup. INFO: running the benchmark.. OK: finished running benchmark.. --------------------SUMMARY-------------------------- Microbenchmark for network : convnext_small Num devices: 1 Dtype: FP16 Mini batch size [img] : 64 Time per mini-batch : 0.2943673968315125 Throughput [img/sec] : 217.41538189649373

When running more real-world tasks (ex. https://github.com/fastai/course22/issues/96), MI100 and 7900XTX seem to perform very similarly. Do you expect that 7900XTX will eventually perform better than MI100 like it is in pytorch micro benchmark? or does MI100 still need more optimization? new-ai-benchmark shows 7900XTX to be faster than MI100 (see this and this even though MI100 has much higher theoretical fp16 performance.

Also, if you know of any document that shows a relative performance of different AMD hardware for ML tasks, I'd really like to see it.

Thank you!

cgmillette commented 12 months ago

Hi @briansp2020,

No, I built from cmake with no special parameters just like in the README.md:

CC=hipcc CXX=hipcc cmake -B<build_dir> . -DAMDGPU_TARGETS=gfx908:xnack-
cd <build_dir>
make perf_hgemm

Just need to clarify:

The sample that I ran is perf_hgemm, which is the fp16 input datatype (hgemm = fp16). This is supported on both 7900XTX (blockM/N = 16), and MI-100 (BlockM/N = 16, 32).

I noticed that you previously ran perf_sgemm, which is fp32 input datatype (sgemm = fp32). This is not supported on 7900XTX, however is supported on MI-100 (BlockM/N = 16, 32)

Please note that the performances for these two datatypes are very different. Also note that the supported block sizes are different.

Comparing 7900XTX with MI-100 is not quite an "apples-to-apples" exercise.

The first major difference is the architectures - the former being RDNA, and the latter being CDNA.

They both have matrix-multiply functionalities, however RDNA cards are more of a consumer-grade gaming / graphics cards and CDNA cards are more data-center, HPC centric.

Each of the two cards have vastly different properties in terms of CU counts, clocks, memory type, capacity and bandwidth. This means that either of them might have an advantage depending on what kind of workload you are doing in AI. If your problem is compute-bound, you might see better performance with higher clocks and higher CU count. Alternatively if your problem is memory -bound, you may see better performance in more memory and higher-bandwidth.

Because of the vastness of different AI problems, there unfortunately is no blanket solution to winning all benchmarks. We can only try to pick the best tool (card) for the particular job.

rocWMMA's job in the meantime is to enable users to leverage matrix-multiply hardware and therefore our focus is more on MFMA / WMMA enablement and performance. For other tools such as pytorch or Tensorflow, they especially could give better answers to questions about their particular benchmarks.

Cheers!

briansp2020 commented 12 months ago

@cgmillette Thank you for pointing out my mistake. I ran perf_hgemm on MI100 and the numbers are much more reasonable even though still much slower than it should be. Do you have any idea why I may be getting such a low score? I'm using Ryzen 7900X and am using a cooler from eBay. It does not get too hot so I don't think it's cooling issue...

./perf_hgemm Initializing host data... Initializing device data... Launching GEMM kernel... gridDim (56 56) blockdim (128 2) TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s 128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 78.143, 736.587, 47.1307

cgmillette commented 12 months ago

@briansp2020 Can you tell me which version of ROCm you are using?

briansp2020 commented 12 months ago

5.7.1 docker I built (based on this) running on Ubuntu 22.04.3 server, Kernel 5.15 + ROCm5.7.1 dkms. I'm now building 5.6.1 docker to try it. If I still get a low performance with 5.6.1, I'll try downgrading kernel module to 5.6 and see if that helps. But it's easier to just try 5.6 user land stuff first. Thank you!

briansp2020 commented 12 months ago

I just ran it using 5.6.1 docker I built and the result looks broken. Do I need to match the docker user land files with kernel module?

./perf_hgemm Initializing host data... Initializing device data... Launching GEMM kernel... gridDim (56 56) blockdim (128 2) TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s 128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 0.0072, 736.587, 511519 Finished!

briansp2020 commented 12 months ago

@cgmillette I tried rocm/pytorch:latest and am getting the following, which looks much better. I guess the issue is my docker container, even though I have no idea why the docker I build is having issues. I'll keep investing. Thank you for your help.

./perf_hgemm Initializing host data... Initializing device data... Launching GEMM kernel... gridDim (56 56) blockdim (128 2) TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s 128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 39.0735, 736.587, 94.2566 Finished!

cgmillette commented 12 months ago

Right on! Most welcome, and take care