baidu-research / DeepBench

Benchmarking Deep Learning operations on different hardware
Apache License 2.0
1.07k stars 239 forks source link

Problem with AMD benchmark #105

Open computingdolas opened 6 years ago

computingdolas commented 6 years ago

I am getting the 1/10 flops/s on the AMD Vega architecture as compared to one mentioned in the results folder. Anybody know why ???

sunway513 commented 6 years ago

Hi @computingdolas , could you help provide the versions on your ROCm software environment?

computingdolas commented 6 years ago

Thanks @sunway513 for your response. Here is the what rocminfo says 👍

===================== HSA System Attributes

Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (number of timestamp) Machine Model: LARGE System Endianness: LITTLE

sunway513 commented 6 years ago

Thanks @computingdolas , can you share more information? e.g. the log of: apt --installed list | grep rocm Or the following in centos: rpm -qa | grep rocm

computingdolas commented 6 years ago

Hey @sunway513

See this :)

rocm-clang-ocl-0.3.0_7997136-1.x86_64 rocm-device-libs-0.0.1-1.x86_64 rocm-dev-1.8.192-1.x86_64 rocminfo-1.0.0-1.x86_64 rocm-amdgpu-pro-icd-17.50-552542.el7.x86_64 rocm-opencl-1.2.0-2018071635.x86_64 rocm-libs-1.8.192-1.x86_64 rocm-smi-1.0.0_46_g81ef66f-1.x86_64 rocm-amdgpu-pro-17.50-552542.el7.x86_64 rocm-utils-1.8.192-1.x86_64 rocm-profiler-5.4.6878-g15f6673.x86_64 rocm-opencl-devel-1.2.0-2018071635.x86_64 rocm-dkms-1.8.192-1.x86_64 rocm-amdgpu-pro-opencl-17.50-552542.el7.x86_64

sunway513 commented 6 years ago

Great, so you are on the latest ROCm, thanks :-) I'll try to reproduce your result and update here.

computingdolas commented 6 years ago

Here are my results for gemm benchmark in code/amd folder. matrix flops approximately is 2mn*k and TFLOPS = (flops/time)/10^12. For the first case in this I am getting somewhere around 0.15 TFLOPs but I should according to results folder get 1.5 TFLOPs. Please find the data below 👍 m n k a_t b_t time (usec) 1760 16 1760 n n 644 1760 32 1760 n n 657 1760 64 1760 n n 690 1760 128 1760 n n 692 1760 7000 1760 n n 8626 2048 16 2048 n n 752 2048 32 2048 n n 774 2048 64 2048 n n 881 2048 128 2048 n n 933 2048 7000 2048 n n 11562 2560 16 2560 n n 939 2560 32 2560 n n 963 2560 64 2560 n n 1043 2560 128 2560 n n 1110 2560 7000 2560 n n 17520 4096 16 4096 n n 1523 4096 32 4096 n n 1556 4096 64 4096 n n 1860 4096 128 4096 n n 2314 4096 7000 4096 n n 48839 1760 16 1760 t n 1104 1760 32 1760 t n 1114 1760 64 1760 t n 1173 1760 128 1760 t n 1307 1760 7000 1760 t n 10391 2048 16 2048 t n 1375 2048 32 2048 t n 1404 2048 64 2048 t n 1889 2048 128 2048 t n 2131 2048 7000 2048 t n 14931 2560 16 2560 t n 1853 2560 32 2560 t n 1889 2560 64 2560 t n 2146 2560 128 2560 t n 2324 2560 7000 2560 t n 21081 4096 16 4096 t n 3368 4096 32 4096 t n 3459 4096 64 4096 t n 3660 4096 128 4096 t n 12966 4096 7000 4096 t n 57209 1760 7133 1760 n t 7234 2048 7133 2048 n t 8275 2560 7133 2560 n t 13501 4096 7133 4096 n t 36544 5124 9124 1760 n n 32020 35 8457 1760 n n 985 5124 9124 2048 n n 41212 35 8457 2048 n n 2658 5124 9124 2560 n n 48522 35 8457 2560 n n 1729 5124 9124 4096 n n 82356 35 8457 4096 n n 4522 5124 9124 1760 t n 37142 35 8457 1760 t n 1438 5124 9124 2048 t n 46961 35 8457 2048 t n 2351 5124 9124 2560 t n 54639 35 8457 2560 t n 2441 5124 9124 4096 t n 92358 35 8457 4096 t n 4228 7680 16 2560 n n 989 7680 32 2560 n n 977 7680 64 2560 n n 1162 7680 128 2560 n n 1337 7680 16 2560 t n 2262 7680 32 2560 t n 2257 7680 64 2560 t n 3044 7680 128 2560 t n 3402 3072 16 1024 n n 389 3072 32 1024 n n 399 3072 64 1024 n n 496 3072 128 1024 n n 586 3072 16 1024 t n 882 3072 32 1024 t n 902 3072 64 1024 t n 1034 3072 128 1024 t n 1527 3072 7435 1024 n t 7455 7680 5481 2560 n t 34993 512 8 500000 n n 176088 1024 8 500000 n n 175514 512 16 500000 n n 178471 1024 16 500000 n n 177038 512 8 500000 t n 336153 1024 8 500000 t n 337628 512 16 500000 t n 337572 1024 16 500000 t n 336461 1024 700 512 n n 304 1024 700 512 t n 446 7680 24000 2560 n n 187187 6144 24000 2048 n n 124284 4608 24000 1536 n n 68394 8448 24000 2816 n n 225464 3072 24000 1024 n n 31407 7680 48000 2560 n n 380244 6144 48000 2048 n n 249364 4608 48000 1536 n n 137419 8448 48000 2816 n n 454826 3072 48000 1024 n n 65120

Thank you for your support @sunway513

sunway513 commented 6 years ago

Hi @computingdolas, my numbers are very different (much faster) than yours, please find it here: rocm-deepbench.log

To reproduce my number, please use the following command to run the docker image I've prepared:

alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video -v $HOME/dockerx:/dockerx'
drun rocm/deepbench:latest
computingdolas commented 6 years ago

Hey @sunway513 Thank you for your response. Those are nice numbers for AMD GPUs. Why I am getting this issue any idea ?

computingdolas commented 6 years ago

Just to confirm you are still using AMD Vega gfx900. I am using AMD Pro SSG ?

computingdolas commented 6 years ago

Is it the driver problem because I am really confused now ? I am looking non-docker solution. I want to know what happened that these numbers are so bad ?

sunway513 commented 6 years ago

Hi @computingdolas , my test GPU is MI25, it's GFX900 based. Could you firstly try with the docker? If you still get suboptimal performance data, that means your driver stack was not properly configured. If that can boost your performance, then it's a userland software issue -- we can take a further look from there.

dagamayank commented 6 years ago

@computingdolas

I am getting the 1/10 flops/s on the AMD Vega architecture as compared to one mentioned in the results folder.

Please also clarify which GPU are you really using. I saw some reference of AMD Pro SSG, and that is NOT one of the supported deep learning AMD GPUs.

computingdolas commented 6 years ago

hi @sunway513 Ok let's try the docker solution and I will update you in that :)

@dagamayank Hey I am using AMD SSG-PRO which is Vega 10 XT architecture. Are you saying we have good ROCm support for that GPU ? I saw the white paper and the data sheet and I saw many references where they mentioned about this capabilities for deep learning stuff. Can you let me know more about this.

Thanks :)

computingdolas commented 6 years ago

Correction @dagamayank Are you saying we don't have good ROCm support for this GPU ?

computingdolas commented 6 years ago

@sunway513 Is it possible to provide me remote access to your AMD mi25 GPU ?

sunway513 commented 6 years ago

@computingdolas , I'm not able to provide public access to the MI25 node. However, you can alternatively try with third-party cloud services using VegaFE: https://www.gpueater.com/

computingdolas commented 6 years ago

Hey @sunway513 but https://www.gpueater.com/ don't have MI25 GPUs although having Vega frontier edition

sunway513 commented 6 years ago

Yes, VegaFE should run ROCm fine with the similar performance as what I've provided in my log for MI25.