Closed briansp2020 closed 12 months ago
Hi @briansp2020, Thanks for reaching out!
Which release of ROCm are you using? Then I can see if I can reproduce the performance you are seeing on MI-100.
NB: For this particular sample, you have to be careful with supported block sizes on 7900XTX, as RDNA cards only support blockM/N of 16. The benchmark would run, but it won't validate successfully in debug mode. The challenge is that 'high performing' GEMMs may have different parameters on different architectures. This issue has been reported, and will be addressed in a future release.
By example, I just ran the sample on MI-100 around ROCm 5.6 release, which achieved close to 90 TFlops - appears typical for this release.
Initializing host data... Initializing device data... Launching GEMM kernel... gridDim (56 56) blockdim (128 2) TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s 128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 41.825, 736.587, 88.0557 Finished!
@cgmillette Did you have to specify any parameters to get 90TF? I think I specified command line parameters when I ran the test on my 7900XTX before. Unfortunately, I forgot what the parameters were (I got them from an internet search and did not write it down. Doh!). Or are you supposed to get close to 90TF just by running perf_sgemm. I'm using a second hand MI100 I got off eBay in my 7900XT PC and built rocWMMA from git today.
Since I have your attention, I'd like to ask some questions. I'm trying to figure out how fast 7900XTX will eventually become when the software matures. I'm trying to compare it to MI100 since I'm assuming MI100 software support is mature and its theoretical fp16 peak number is similar. So far I have run some micro-benchmarks and am getting conflicting results.
Using TensorFlow, MI100 is much faster with CNNs
root@rocm:/root/benchmarks/scripts/tf_cnn_benchmarks# python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
TensorFlow: 2.13 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 128 global 128 per device Num batches: 100 Num epochs: 0.01 Devices: ['/gpu:0'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: parameter_server
Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 1135.2 +/- 0.0 (jitter = 0.0) 7.788 10 images/sec: 1138.1 +/- 1.0 (jitter = 3.5) 7.743 20 images/sec: 1138.7 +/- 0.7 (jitter = 4.3) 7.823 30 images/sec: 1138.5 +/- 0.5 (jitter = 3.4) 7.963 40 images/sec: 1138.2 +/- 0.4 (jitter = 2.4) 7.889 50 images/sec: 1137.9 +/- 0.4 (jitter = 2.4) 7.787 60 images/sec: 1137.7 +/- 0.4 (jitter = 2.4) 8.015 70 images/sec: 1137.3 +/- 0.4 (jitter = 2.9) 7.876 80 images/sec: 1137.1 +/- 0.3 (jitter = 2.9) 7.931 90 images/sec: 1136.8 +/- 0.3 (jitter = 3.3) 7.734 100 images/sec: 1136.4 +/- 0.3 (jitter = 3.3) 7.987
total images/sec: 1136.18
compared to 7900XTX
python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --use_fp16=True --model=resnet50
Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 706.8 +/- 0.0 (jitter = 0.0) 7.444 10 images/sec: 703.8 +/- 1.9 (jitter = 1.8) 7.422 20 images/sec: 703.2 +/- 1.0 (jitter = 2.4) 7.468 30 images/sec: 703.1 +/- 0.8 (jitter = 3.2) 7.564 40 images/sec: 702.9 +/- 0.8 (jitter = 3.1) 7.518 50 images/sec: 703.0 +/- 0.6 (jitter = 3.1) 7.447 60 images/sec: 703.4 +/- 0.6 (jitter = 3.1) 7.603 70 images/sec: 703.1 +/- 0.5 (jitter = 3.6) 7.516 80 images/sec: 703.0 +/- 0.5 (jitter = 3.7) 7.560 90 images/sec: 703.0 +/- 0.4 (jitter = 3.6) 7.433 100 images/sec: 702.9 +/- 0.4 (jitter = 3.7) 7.601
total images/sec: 702.71
But PyTorch micro bench gives the opposite result MI100
python micro_benchmarking_pytorch.py --network convnext_small --fp16 1 INFO: running forward and backward for warmup. INFO: running the benchmark.. OK: finished running benchmark.. --------------------SUMMARY-------------------------- Microbenchmark for network : convnext_small Num devices: 1 Dtype: FP16 Mini batch size [img] : 64 Time per mini-batch : 0.33329823017120364 Throughput [img/sec] : 192.02022155090785
7900XTX
python micro_benchmarking_pytorch.py --network convnext_small --fp16 1 INFO: running forward and backward for warmup. INFO: running the benchmark.. OK: finished running benchmark.. --------------------SUMMARY-------------------------- Microbenchmark for network : convnext_small Num devices: 1 Dtype: FP16 Mini batch size [img] : 64 Time per mini-batch : 0.2943673968315125 Throughput [img/sec] : 217.41538189649373
When running more real-world tasks (ex. https://github.com/fastai/course22/issues/96), MI100 and 7900XTX seem to perform very similarly. Do you expect that 7900XTX will eventually perform better than MI100 like it is in pytorch micro benchmark? or does MI100 still need more optimization? new-ai-benchmark shows 7900XTX to be faster than MI100 (see this and this even though MI100 has much higher theoretical fp16 performance.
Also, if you know of any document that shows a relative performance of different AMD hardware for ML tasks, I'd really like to see it.
Thank you!
Hi @briansp2020,
No, I built from cmake with no special parameters just like in the README.md:
CC=hipcc CXX=hipcc cmake -B<build_dir> . -DAMDGPU_TARGETS=gfx908:xnack-
cd <build_dir>
make perf_hgemm
Just need to clarify:
The sample that I ran is perf_hgemm, which is the fp16 input datatype (hgemm = fp16). This is supported on both 7900XTX (blockM/N = 16), and MI-100 (BlockM/N = 16, 32).
I noticed that you previously ran perf_sgemm, which is fp32 input datatype (sgemm = fp32). This is not supported on 7900XTX, however is supported on MI-100 (BlockM/N = 16, 32)
Please note that the performances for these two datatypes are very different. Also note that the supported block sizes are different.
Comparing 7900XTX with MI-100 is not quite an "apples-to-apples" exercise.
The first major difference is the architectures - the former being RDNA, and the latter being CDNA.
They both have matrix-multiply functionalities, however RDNA cards are more of a consumer-grade gaming / graphics cards and CDNA cards are more data-center, HPC centric.
Each of the two cards have vastly different properties in terms of CU counts, clocks, memory type, capacity and bandwidth. This means that either of them might have an advantage depending on what kind of workload you are doing in AI. If your problem is compute-bound, you might see better performance with higher clocks and higher CU count. Alternatively if your problem is memory -bound, you may see better performance in more memory and higher-bandwidth.
Because of the vastness of different AI problems, there unfortunately is no blanket solution to winning all benchmarks. We can only try to pick the best tool (card) for the particular job.
rocWMMA's job in the meantime is to enable users to leverage matrix-multiply hardware and therefore our focus is more on MFMA / WMMA enablement and performance. For other tools such as pytorch or Tensorflow, they especially could give better answers to questions about their particular benchmarks.
Cheers!
@cgmillette Thank you for pointing out my mistake. I ran perf_hgemm on MI100 and the numbers are much more reasonable even though still much slower than it should be. Do you have any idea why I may be getting such a low score? I'm using Ryzen 7900X and am using a cooler from eBay. It does not get too hot so I don't think it's cooling issue...
./perf_hgemm Initializing host data... Initializing device data... Launching GEMM kernel... gridDim (56 56) blockdim (128 2) TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s 128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 78.143, 736.587, 47.1307
@briansp2020 Can you tell me which version of ROCm you are using?
5.7.1 docker I built (based on this) running on Ubuntu 22.04.3 server, Kernel 5.15 + ROCm5.7.1 dkms. I'm now building 5.6.1 docker to try it. If I still get a low performance with 5.6.1, I'll try downgrading kernel module to 5.6 and see if that helps. But it's easier to just try 5.6 user land stuff first. Thank you!
I just ran it using 5.6.1 docker I built and the result looks broken. Do I need to match the docker user land files with kernel module?
./perf_hgemm Initializing host data... Initializing device data... Launching GEMM kernel... gridDim (56 56) blockdim (128 2) TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s 128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 0.0072, 736.587, 511519 Finished!
@cgmillette I tried rocm/pytorch:latest and am getting the following, which looks much better. I guess the issue is my docker container, even though I have no idea why the docker I build is having issues. I'll keep investing. Thank you for your help.
./perf_hgemm Initializing host data... Initializing device data... Launching GEMM kernel... gridDim (56 56) blockdim (128 2) TBlockX, TBlockY, BlocksX, BlocksY, BlkM, BlkN, BlkK, MatM, MatN, MatK, alpha, lda, ldb, beta, ldc, ldd, elapsedMs, Problem Size(GFlops), TFlops/s 128, 2, 2, 2, 32, 32, 16, 7168, 7168, 7168, 2, 7168, 7168, 2, 7168, 7168, 39.0735, 736.587, 94.2566 Finished!
Right on! Most welcome, and take care
What is the expected performance of MI100? I was expecting a much higher number since the theoretical performance is more than 180TF. I was getting higher numbers when I was testing 7900XTX even though it has a lower theoretical peak performance!