NIIAS3050 commented 5 years ago

It would be very useful to compare real training performance on amd and nvidia cards. For Nvidia cards we have a lot of graphs and tests, for example: https://github.com/u39kun/deep-learning-benchmark But for AMD cards there is no performance metrics. It will be great to made direct comparsion between AND and NVIDIA with last cuDNN.

ghostplant commented 5 years ago

@Mandrewoid I just wonder whether ROCm for AMD could be ready for production purpose. If not, CUDA for NVIDIA might still be a stable choice. However, ROCm is better than CUDA for its open source. Just hope ROCm could be ready for production soon.

SandboChang commented 5 years ago

Adding my results from reddit https://www.reddit.com/r/Amd/comments/asdyon/radeon_vii_tensorflow_deep_learning_results_huge/

amirjaber commented 5 years ago

I ran benchmarks in a Centos7 docker container: rocm/tensorflow:rocm2.1-tf1.13-centos-dev with a MI25 card but received inferior results than others. By adding FP16 it goes even worse consistently.

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

Step Img/sec total_loss 1 images/sec: 210.7 +/- 0.0 (jitter = 0.0) 8.221 10 images/sec: 209.5 +/- 0.6 (jitter = 0.9) 8.285 20 images/sec: 209.1 +/- 0.5 (jitter = 1.2) 8.061 30 images/sec: 208.9 +/- 0.4 (jitter = 1.3) 8.312 40 images/sec: 208.1 +/- 0.7 (jitter = 1.1) 8.178 50 images/sec: 208.3 +/- 0.6 (jitter = 1.1) 8.243 60 images/sec: 208.3 +/- 0.5 (jitter = 1.1) 8.186 70 images/sec: 208.2 +/- 0.5 (jitter = 1.2) 8.156 80 images/sec: 207.9 +/- 0.4 (jitter = 1.5) 8.143 90 images/sec: 207.4 +/- 0.4 (jitter = 2.0) 8.215 100 images/sec: 206.7 +/- 0.4 (jitter = 2.7) 8.155

total images/sec: 206.62

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

Step Img/sec total_loss 1 images/sec: 193.8 +/- 0.0 (jitter = 0.0) 8.209 10 images/sec: 192.0 +/- 0.6 (jitter = 1.0) 8.183 20 images/sec: 191.8 +/- 0.5 (jitter = 0.8) 8.318 30 images/sec: 191.5 +/- 0.4 (jitter = 1.1) 8.197 40 images/sec: 191.4 +/- 0.4 (jitter = 0.9) 8.150 50 images/sec: 191.3 +/- 0.4 (jitter = 1.1) 8.391 60 images/sec: 191.0 +/- 0.4 (jitter = 1.3) 8.272 70 images/sec: 190.9 +/- 0.4 (jitter = 1.3) 8.148 80 images/sec: 190.7 +/- 0.4 (jitter = 1.2) 8.290 90 images/sec: 190.7 +/- 0.4 (jitter = 1.2) 8.330 100 images/sec: 190.8 +/- 0.3 (jitter = 1.0) 8.210

total images/sec: 190.71

hyc3z commented 5 years ago

Heres the benchmark result of vega 56 I get on Ubuntu 19, using amdkfd, just for a comparison:

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50

Step   Img/sec total_loss
1   images/sec: 168.3 +/- 0.0 (jitter = 0.0)    8.166
10  images/sec: 165.3 +/- 1.4 (jitter = 0.7)    8.217
20  images/sec: 164.8 +/- 0.9 (jitter = 1.8)    8.361
30  images/sec: 163.9 +/- 0.7 (jitter = 5.6)    8.255
40  images/sec: 164.2 +/- 0.6 (jitter = 4.3)    8.125
50  images/sec: 165.0 +/- 0.5 (jitter = 2.4)    8.173
60  images/sec: 165.5 +/- 0.5 (jitter = 1.0)    8.326
70  images/sec: 165.1 +/- 0.7 (jitter = 0.7)    8.359
80  images/sec: 165.5 +/- 0.6 (jitter = 0.5)    8.038
90  images/sec: 164.9 +/- 0.7 (jitter = 0.5)    8.292
100 images/sec: 165.2 +/- 0.6 (jitter = 0.5)    8.381
----------------------------------------------------------------
total images/sec: 165.18

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=alexnet

Step   Img/sec total_loss
1   images/sec: 1141.2 +/- 0.0 (jitter = 0.0)   7.199
10  images/sec: 1147.5 +/- 1.4 (jitter = 4.2)   7.199
20  images/sec: 1152.0 +/- 1.6 (jitter = 7.1)   7.199
30  images/sec: 1152.8 +/- 1.3 (jitter = 8.0)   7.199
40  images/sec: 1153.0 +/- 1.0 (jitter = 6.2)   7.200
50  images/sec: 1153.7 +/- 1.0 (jitter = 6.9)   7.199
60  images/sec: 1153.9 +/- 0.9 (jitter = 7.5)   7.200
70  images/sec: 1152.5 +/- 1.0 (jitter = 7.8)   7.199
80  images/sec: 1152.4 +/- 0.9 (jitter = 7.8)   7.199
90  images/sec: 1152.9 +/- 0.9 (jitter = 6.6)   7.199
100 images/sec: 1153.2 +/- 0.8 (jitter = 6.5)   7.199
----------------------------------------------------------------
total images/sec: 1153.00
----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=inception3

Step   Img/sec total_loss
1   images/sec: 79.6 +/- 0.0 (jitter = 0.0) 7.357
10  images/sec: 79.4 +/- 0.2 (jitter = 0.1) 7.438
20  images/sec: 79.3 +/- 0.2 (jitter = 0.2) 7.322
30  images/sec: 78.7 +/- 0.3 (jitter = 0.3) 7.480
40  images/sec: 78.9 +/- 0.2 (jitter = 0.2) 7.369
50  images/sec: 78.9 +/- 0.2 (jitter = 0.2) 7.347
60  images/sec: 78.9 +/- 0.2 (jitter = 0.2) 7.420
70  images/sec: 78.9 +/- 0.2 (jitter = 0.2) 7.306
80  images/sec: 78.9 +/- 0.1 (jitter = 0.2) 7.370
90  images/sec: 79.0 +/- 0.1 (jitter = 0.2) 7.506
100 images/sec: 78.9 +/- 0.1 (jitter = 0.2) 7.405
----------------------------------------------------------------
total images/sec: 78.92

The GPU model I use is XFX Vega 56 nano, which has a TDP limit of 150w , If set the poweroverdrive the VRM is going to overheat. So basically the gpu is running at 1530 MHz, with HBM running at 800MHz.

ghostplant commented 5 years ago

@sebpuetz I am also tring Radeon VII, I use Ubuntu 16.04 with Linux kernel = 5.0 and I also got similar bad performance (190 images/sec for resnet50 using batch = 64 and dtype = float32), is it solved by using a fresh Ubuntu 18.04? As far as I know, Ubuntu 18.04 integrates Linux kernel = 4.15 which is not including the driver for Radeon VII, so do you further upgrade the upgrade kernel version on your Ubuntu 18.04 system?

sebpuetz commented 5 years ago

Ubuntu 18.04 ships with 4.18 iirc. Anyways, ROCm 2.2 is broken on 4.18 and is supposed to be fixed by 2.3. I was able to install it on 4.15. If you downgrade to 4.15 you should be getting the expected (fast) performance.

I'm currently on kernel 5.05 since the decrease in performance doesn't really matter to me as I can't properly use my GPU for my current project as per #325. Fwiw, all of the upstream kernels with the built in AMD driver suffer from the performance drop (starting with 4.20, I believe).

sunway513 commented 5 years ago

@ghostplant , referring to the following document on upstream kernel support: https://github.com/RadeonOpenCompute/ROCm#rocm-support-in-upstream-linux-kernels The upstream kernel is NOT tested by AMD to the same level as rock-dkms package and doesn't include the most up-to-date firmware. Firmware can slow down the RadeonVII performance.

If you can choose, please go with Ubuntu16.04 or Ubuntu18.04.1 with 4.15 kernel and install the rock-dkms package from ROCm2.2 as your bare metal configuration.

ghostplant commented 5 years ago

@sunway513 I have tested Ubuntu 16.04 + Kernel 4.15 + rock-dkms whose ROCm performance is really bad, even worse than Vega 64.

sunway513 commented 5 years ago

@ghostplant can you create a new issue? Let's review the environment.

ghostplant commented 5 years ago

@sunway513 Not needed any more. I kept Ubuntu 16.04 + Linux 4.15 + rock-dkms and I didn't reinstall Ubuntu 18.04. The Resnet50 performance is improved to 268 images/sec for float32 & batch_size = 64, just by running a Ubuntu 18.04 based docker image with rocm and tensorflow installed.

ghostplant commented 5 years ago

@sebpuetz Are you using Linux Mint? I looked up into official Ubuntu 18.04 and it provides Linux kernel 4.15 by default.

sebpuetz commented 5 years ago

@ghostplant, I had been initially. But there were some problems, so I switched to Ubuntu 18.04.

ghostplant commented 5 years ago

@sebpuetz @Hycdog @jimdowling @sunway513

We mostly paid attention to Resnet50 model, which is really fast with ROCm. However, other typically CNN models like Inception3/Alexnet/.. are not quite impressive, and many of them are even worse than a NVIDIA 1080. What about your opinions on optimizing on these models?

hongshaojichi commented 5 years ago

Just updated to ROCm 2.3. Looks like the tf_cnn_benchmark tends to hang on "running warm up". Anyone else having the same issue? I have a vega 64, running ubuntu 18.04 LTS with 4.18 kernel.

sunway513 commented 5 years ago

@hongshaojichi can you create a new issue with more details? Let's track it there.

sunway513 commented 5 years ago

ROCm2.3 is out, the majority of Tensorflow CNN benchmarks performance has been further improved. Let me list the instructions to upgrade:

Bare metal setup

uninstall the current rocm packages sudo apt autoremove rocm-dkms rocm-dev rocm-utils rocm-smi rock-dkms
update and install the new packages, reboot to re-load the new kernel sudo apt update && sudo apt install -y rocm-dkms rocm-libs miopen-hip cxlactivitylogger -y sudo reboot
update tensorflow whl package pip3 install --user tensorflow-rocm --upgrade
clean your MIOpen cache rm -rf ~/.cache && rm -rf ~/.config
apply the performance databse update for Radeon VII 60CU card cd ~/ && mkdir -p .config/miopen && cd .config/miopen && wget https://www.dropbox.com/s/yd9v7jtc9aydnfy/gfx906_60.cd.updb.txt && cd ~

Docker setup

To use docker container option, updating the rock-dkms is needed: sudo apt update && sudo apt install rock-dkms && sudo reboot
Then load the latest ROCm2.3 based docker container: sudo docker pull rocm/tensorflow:rocm2.3-tf1.13-python3

For new deployment, please refer to our official doc: https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-install-basic.md Please try out ROCm2.3 and let us know your feedbacks :-)

sebpuetz commented 5 years ago

Following the instructions in your reply to this issue I upgraded to ROCm 2.3 and ran some of the benchmarks inside the rocm/tensorflow:rocm2.3-tf1.13-python3 container. TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=alexnet and TF_ROCM_FUSION_ENABLE=0 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=alexnet both result in NAN losses.

Step    Img/sec total_loss
1   images/sec: 2022.2 +/- 0.0 (jitter = 0.0)   nan
10  images/sec: 2018.2 +/- 7.0 (jitter = 8.5)   nan
20  images/sec: 2026.2 +/- 4.0 (jitter = 7.4)   nan
30  images/sec: 2025.9 +/- 2.8 (jitter = 8.1)   nan
40  images/sec: 2025.4 +/- 2.3 (jitter = 9.0)   nan
50  images/sec: 2025.4 +/- 1.9 (jitter = 8.8)   nan
60  images/sec: 2026.4 +/- 1.6 (jitter = 9.0)   nan
70  images/sec: 2026.4 +/- 1.4 (jitter = 9.5)   nan
80  images/sec: 2026.1 +/- 1.3 (jitter = 9.6)   nan
90  images/sec: 2025.7 +/- 1.2 (jitter = 9.1)   nan
100 images/sec: 2025.7 +/- 1.1 (jitter = 8.3)   nan
----------------------------------------------------------------
total images/sec: 2025.20
----------------------------------------------------------------

Other benchmarks perform slightly better than before:

Step    Img/sec total_loss
1   images/sec: 309.6 +/- 0.0 (jitter = 0.0)    7.972
10  images/sec: 307.0 +/- 1.1 (jitter = 2.1)    7.856
20  images/sec: 308.6 +/- 0.6 (jitter = 0.4)    7.914
30  images/sec: 309.0 +/- 0.4 (jitter = 0.3)    7.733
40  images/sec: 309.2 +/- 0.3 (jitter = 0.3)    7.969
50  images/sec: 309.3 +/- 0.3 (jitter = 0.4)    8.027
60  images/sec: 309.1 +/- 0.2 (jitter = 0.5)    7.890
70  images/sec: 309.1 +/- 0.2 (jitter = 0.6)    7.983
80  images/sec: 309.0 +/- 0.2 (jitter = 0.7)    7.814
90  images/sec: 309.0 +/- 0.2 (jitter = 0.8)    7.778
100 images/sec: 308.7 +/- 0.2 (jitter = 0.8)    7.806
----------------------------------------------------------------
total images/sec: 308.65
----------------------------------------------------------------

sunway513 commented 5 years ago

@sebpuetz , thank you for trying it out! The alexnet nan loss is a known issue due to tf_cnn_benchmark changes, you can change to cnn_tf_v1.12_compatible https://github.com/tensorflow/benchmarks/tree/cnn_tf_v1.12_compatible branch in ~/benchmarks to work around it; we have seen the same behavior using the upstream tf1.13.1 docker image.

Could you list the complete command for the second log? I suppose that's resnet50 FP32, correct?

sebpuetz commented 5 years ago

@sebpuetz , thank you for trying it out! The alexnet nan loss is a known issue due to tf_cnn_benchmark changes, you can change to cnn_tf_v1.12_compatible https://github.com/tensorflow/benchmarks/tree/cnn_tf_v1.12_compatible branch in ~/benchmarks to work around it; we have seen the same behavior using the upstream tf1.13.1 docker image.

No more NANs on that branch:

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=alexnet

Step    Img/sec total_loss
1   images/sec: 1949.8 +/- 0.0 (jitter = 0.0)   7.199
10  images/sec: 1948.5 +/- 1.7 (jitter = 4.9)   7.200
20  images/sec: 1953.4 +/- 2.4 (jitter = 11.0)  7.199
30  images/sec: 1955.9 +/- 2.0 (jitter = 8.9)   7.199
40  images/sec: 1956.4 +/- 1.7 (jitter = 8.1)   7.198
50  images/sec: 1956.4 +/- 1.4 (jitter = 7.9)   7.199
60  images/sec: 1957.0 +/- 1.2 (jitter = 7.8)   7.200
70  images/sec: 1958.1 +/- 1.2 (jitter = 8.9)   7.199
80  images/sec: 1959.4 +/- 1.1 (jitter = 9.5)   7.199
90  images/sec: 1959.7 +/- 1.0 (jitter = 9.3)   7.199
100 images/sec: 1960.9 +/- 1.0 (jitter = 9.4)   7.199
----------------------------------------------------------------
total images/sec: 1960.43
----------------------------------------------------------------

Could you list the complete command for the second log? I suppose that's resnet50 FP32, correct?

Indeed, forgot to copy the terminal input. TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

sunway513 commented 5 years ago

@sebpuetz thank you, and glad you can get 5% perf improvement on resnet50 FP32.

ghostplant commented 5 years ago

For alexnet, if upgrading rocblas from 2.2.0 to 2.3.4 manually, it can also gain around 20% performance improvement. However, still a large difference with same model using CUDA GPUs of similar TFlops/sec.

WannaBeOCer commented 5 years ago

Radeon VII at stock using 18.04 w/ ROCm 2.3. Around a 28% improvement from 2.2

python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16

Done warm up
Step    Img/sec total_loss
1       images/sec: 344.8 +/- 0.0 (jitter = 0.0)        8.123
10      images/sec: 345.6 +/- 0.3 (jitter = 0.7)        7.752
20      images/sec: 345.3 +/- 0.3 (jitter = 0.7)        7.913
30      images/sec: 345.3 +/- 0.3 (jitter = 0.5)        7.785
40      images/sec: 345.3 +/- 0.2 (jitter = 0.6)        7.917
50      images/sec: 345.4 +/- 0.2 (jitter = 0.6)        7.874
60      images/sec: 345.4 +/- 0.2 (jitter = 0.6)        7.720
70      images/sec: 345.4 +/- 0.1 (jitter = 0.6)        8.016
80      images/sec: 345.5 +/- 0.1 (jitter = 0.6)        7.773
90      images/sec: 345.6 +/- 0.1 (jitter = 0.6)        7.800
100     images/sec: 345.6 +/- 0.1 (jitter = 0.6)        8.027
----------------------------------------------------------------
total images/sec: 345.29
----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16

Done warm up
Step    Img/sec total_loss
1       images/sec: 363.3 +/- 0.0 (jitter = 0.0)        8.117
10      images/sec: 363.8 +/- 0.3 (jitter = 0.7)        7.754
20      images/sec: 363.8 +/- 0.4 (jitter = 1.2)        7.906
30      images/sec: 363.7 +/- 0.3 (jitter = 0.8)        7.780
40      images/sec: 363.8 +/- 0.2 (jitter = 0.9)        7.919
50      images/sec: 363.8 +/- 0.2 (jitter = 0.9)        7.889
60      images/sec: 363.7 +/- 0.2 (jitter = 0.9)        7.726
70      images/sec: 363.7 +/- 0.2 (jitter = 0.8)        8.015
80      images/sec: 363.4 +/- 0.2 (jitter = 1.0)        7.772
90      images/sec: 363.3 +/- 0.2 (jitter = 1.1)        7.816
100     images/sec: 363.4 +/- 0.2 (jitter = 1.0)        8.028
----------------------------------------------------------------
total images/sec: 363.13
----------------------------------------------------------------

sunway513 commented 5 years ago

Hi @WannaBeOCer thank you for posting the numbers. However, it's a bit left than what I'd expected. Could you run the benchmark again after applying the following commands? It'll be helpful if you can provide the performance numbers with batch size 128 as well.

clean your MIOpen cache rm -rf ~/.cache && rm -rf ~/.config
apply the updated performance databse for Radeon VII cd ~/ && mkdir -p .config/miopen && cd .config/miopen && wget https://www.dropbox.com/s/yd9v7jtc9aydnfy/gfx906_60.cd.updb.txt && cd ~

WannaBeOCer commented 5 years ago

@sunway513 Thanks for the update, I applied the changes and I do see a performance uplift.

Batch size of 64 without Fusion:

Done warm up
Step    Img/sec total_loss
1       images/sec: 364.2 +/- 0.0 (jitter = 0.0)        8.119
10      images/sec: 365.6 +/- 0.3 (jitter = 0.8)        7.747
20      images/sec: 365.6 +/- 0.2 (jitter = 0.9)        7.912
30      images/sec: 365.6 +/- 0.1 (jitter = 0.5)        7.791
40      images/sec: 364.8 +/- 0.5 (jitter = 0.7)        7.926
50      images/sec: 365.0 +/- 0.4 (jitter = 0.6)        7.891
60      images/sec: 364.9 +/- 0.4 (jitter = 0.7)        7.703
70      images/sec: 364.9 +/- 0.3 (jitter = 0.7)        7.995
80      images/sec: 364.9 +/- 0.3 (jitter = 0.8)        7.771
90      images/sec: 364.9 +/- 0.2 (jitter = 0.8)        7.819
100     images/sec: 365.0 +/- 0.2 (jitter = 0.8)        8.027
----------------------------------------------------------------
total images/sec: 364.68
----------------------------------------------------------------

Batch size of 128 without Fusion:

Before

Done warm up
Step Img/sec total_loss
1 images/sec: 382.1 +/- 0.0 (jitter = 0.0) 7.876
10 images/sec: 381.4 +/- 0.4 (jitter = 1.1) 7.951
20 images/sec: 381.6 +/- 0.3 (jitter = 0.7) 7.950
30 images/sec: 381.8 +/- 0.2 (jitter = 0.8) 7.942
40 images/sec: 381.7 +/- 0.2 (jitter = 0.7) 7.960
50 images/sec: 381.7 +/- 0.1 (jitter = 0.7) 7.709
60 images/sec: 381.7 +/- 0.1 (jitter = 0.7) 7.914
70 images/sec: 381.7 +/- 0.1 (jitter = 0.7) 7.834
80 images/sec: 381.8 +/- 0.1 (jitter = 0.7) 7.966
90 images/sec: 381.7 +/- 0.1 (jitter = 0.7) 7.803
100 images/sec: 381.6 +/- 0.1 (jitter = 0.9) 7.756
----------------------------------------------------------------
total images/sec: 381.48
----------------------------------------------------------------

After

Done warm up
Step    Img/sec total_loss
1       images/sec: 399.5 +/- 0.0 (jitter = 0.0)        7.875
10      images/sec: 399.9 +/- 0.1 (jitter = 0.5)        7.956
20      images/sec: 399.8 +/- 0.3 (jitter = 0.5)        7.954
30      images/sec: 399.9 +/- 0.2 (jitter = 0.5)        7.939
40      images/sec: 399.9 +/- 0.2 (jitter = 0.7)        7.950
50      images/sec: 399.8 +/- 0.1 (jitter = 0.6)        7.715
60      images/sec: 399.8 +/- 0.1 (jitter = 0.6)        7.920
70      images/sec: 399.8 +/- 0.1 (jitter = 0.7)        7.833
80      images/sec: 399.8 +/- 0.1 (jitter = 0.7)        7.992
90      images/sec: 399.7 +/- 0.1 (jitter = 0.6)        7.802
100     images/sec: 399.7 +/- 0.1 (jitter = 0.6)        7.784
----------------------------------------------------------------
total images/sec: 399.58
----------------------------------------------------------------

With Fusion:

Done warm up
Step    Img/sec total_loss
1       images/sec: 421.5 +/- 0.0 (jitter = 0.0)        7.878
10      images/sec: 422.0 +/- 0.2 (jitter = 0.6)        7.957
20      images/sec: 421.9 +/- 0.1 (jitter = 0.6)        7.952
30      images/sec: 421.8 +/- 0.1 (jitter = 0.6)        7.946
40      images/sec: 421.5 +/- 0.2 (jitter = 0.6)        7.966
50      images/sec: 421.5 +/- 0.1 (jitter = 0.6)        7.708
60      images/sec: 421.6 +/- 0.2 (jitter = 0.6)        7.910
70      images/sec: 421.7 +/- 0.1 (jitter = 0.7)        7.839
80      images/sec: 421.8 +/- 0.1 (jitter = 0.7)        7.960
90      images/sec: 421.7 +/- 0.1 (jitter = 0.7)        7.801
100     images/sec: 421.6 +/- 0.1 (jitter = 0.7)        7.769
----------------------------------------------------------------
total images/sec: 421.41
----------------------------------------------------------------

sunway513 commented 5 years ago

Thank you @WannaBeOCer , would you mind to post the numbers on FP32, fusion enabled?

WannaBeOCer commented 5 years ago

@sunway513 Here are the numbers on FP32 with Fusion enabled.

Before

Done warm up
Step    Img/sec total_loss
1       images/sec: 277.0 +/- 0.0 (jitter = 0.0)        7.972
10      images/sec: 277.7 +/- 0.1 (jitter = 0.2)        7.856
20      images/sec: 277.8 +/- 0.1 (jitter = 0.4)        7.913
30      images/sec: 277.9 +/- 0.1 (jitter = 0.4)        7.731
40      images/sec: 278.0 +/- 0.1 (jitter = 0.5)        7.971
50      images/sec: 278.0 +/- 0.1 (jitter = 0.5)        8.027
60      images/sec: 278.0 +/- 0.0 (jitter = 0.4)        7.890
70      images/sec: 277.9 +/- 0.0 (jitter = 0.4)        7.983
80      images/sec: 277.9 +/- 0.0 (jitter = 0.5)        7.799
90      images/sec: 277.9 +/- 0.0 (jitter = 0.5)        7.792
100     images/sec: 277.9 +/- 0.0 (jitter = 0.5)        7.810
----------------------------------------------------------------
total images/sec: 277.79
----------------------------------------------------------------

After

Done warm up
Step    Img/sec total_loss
1       images/sec: 299.5 +/- 0.0 (jitter = 0.0)        7.972
10      images/sec: 300.4 +/- 0.1 (jitter = 0.1)        7.856
20      images/sec: 300.5 +/- 0.1 (jitter = 0.2)        7.914
30      images/sec: 300.6 +/- 0.1 (jitter = 0.3)        7.732
40      images/sec: 300.6 +/- 0.1 (jitter = 0.3)        7.972
50      images/sec: 300.7 +/- 0.0 (jitter = 0.3)        8.026
60      images/sec: 300.6 +/- 0.1 (jitter = 0.3)        7.894
70      images/sec: 300.6 +/- 0.0 (jitter = 0.3)        7.992
80      images/sec: 300.6 +/- 0.1 (jitter = 0.3)        7.803
90      images/sec: 300.6 +/- 0.0 (jitter = 0.3)        7.786
100     images/sec: 300.6 +/- 0.0 (jitter = 0.3)        7.795
----------------------------------------------------------------
total images/sec: 300.51
----------------------------------------------------------------

ghostplant commented 5 years ago

@WannaBeOCer I cannot reproduce this on Ubuntu 18.04 + Kernel 4.15 + rock-dkms, while mine is 245 for no fusion enabled, and 270 for fusion enabled. (batch_size = 64)

WannaBeOCer commented 5 years ago

@ghostplant That's similar to my results when I was using 2.2, did you upgrade to 2.3 and follow sunway513's comment to update to the performance database?

@sunway513 Here are 128 batch size results with the power target at 300w with fusion enabled.

FP16

Done warm up
Step    Img/sec total_loss
1       images/sec: 436.0 +/- 0.0 (jitter = 0.0)        7.875
10      images/sec: 435.7 +/- 0.2 (jitter = 0.6)        7.952
20      images/sec: 435.7 +/- 0.2 (jitter = 0.9)        7.956
30      images/sec: 435.3 +/- 0.2 (jitter = 0.9)        7.947
40      images/sec: 435.2 +/- 0.2 (jitter = 1.0)        7.958
50      images/sec: 435.2 +/- 0.2 (jitter = 0.8)        7.709
60      images/sec: 435.2 +/- 0.2 (jitter = 0.9)        7.898
70      images/sec: 435.2 +/- 0.1 (jitter = 0.8)        7.846
80      images/sec: 435.1 +/- 0.1 (jitter = 0.9)        7.977
90      images/sec: 435.2 +/- 0.1 (jitter = 0.8)        7.801
100     images/sec: 435.1 +/- 0.1 (jitter = 0.9)        7.782
----------------------------------------------------------------
total images/sec: 434.93
----------------------------------------------------------------

FP32

Done warm up
Step    Img/sec total_loss
1       images/sec: 310.4 +/- 0.0 (jitter = 0.0)        7.972
10      images/sec: 311.1 +/- 0.2 (jitter = 0.5)        7.856
20      images/sec: 310.8 +/- 0.1 (jitter = 0.5)        7.914
30      images/sec: 310.7 +/- 0.1 (jitter = 0.5)        7.734
40      images/sec: 310.7 +/- 0.1 (jitter = 0.4)        7.970
50      images/sec: 310.7 +/- 0.1 (jitter = 0.4)        8.025
60      images/sec: 310.7 +/- 0.1 (jitter = 0.4)        7.896
70      images/sec: 310.6 +/- 0.1 (jitter = 0.4)        7.986
80      images/sec: 310.6 +/- 0.1 (jitter = 0.4)        7.803
90      images/sec: 310.6 +/- 0.1 (jitter = 0.4)        7.799
100     images/sec: 310.5 +/- 0.0 (jitter = 0.4)        7.823
----------------------------------------------------------------
total images/sec: 310.44
----------------------------------------------------------------

ghostplant commented 5 years ago

@WannaBeOCer After upgrading to rocm-2.3, rock-drivers, and also related miopen-db, resnet50 improved to 260 (no fusion) and 282 (with fusion) respectively, still 10% slower.

Can you paste properties below from your environment?

# Kernel version and patch
$ uname -a
Linux testing 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

# Tensorflow properties:
2019-04-21 06:05:47.487805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1531] Found device 0 with properties:
name: Device 66af
AMDGPU ISA: gfx906
memoryClockRate (GHz) 1.802
pciBusID 0000:04:00.0
Total memory: 15.98GiB
Free memory: 15.73GiB

WannaBeOCer commented 5 years ago

@ghostplant That result seems the same as mine if you're using FP32 with a batch size of 64.

python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50

Done warm up
Step    Img/sec total_loss
1       images/sec: 261.3 +/- 0.0 (jitter = 0.0)        8.220
10      images/sec: 263.4 +/- 0.5 (jitter = 0.9)        7.880
20      images/sec: 263.4 +/- 0.3 (jitter = 0.7)        7.910
30      images/sec: 263.5 +/- 0.2 (jitter = 0.6)        7.820
40      images/sec: 263.2 +/- 0.3 (jitter = 0.4)        8.004
50      images/sec: 263.2 +/- 0.2 (jitter = 0.4)        7.769
60      images/sec: 263.2 +/- 0.2 (jitter = 0.4)        8.112
70      images/sec: 263.3 +/- 0.2 (jitter = 0.5)        7.816
80      images/sec: 263.6 +/- 0.2 (jitter = 0.6)        7.977
90      images/sec: 263.7 +/- 0.2 (jitter = 0.7)        8.097
100     images/sec: 263.6 +/- 0.2 (jitter = 0.6)        8.039
----------------------------------------------------------------
total images/sec: 263.42
----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50

Done warm up
Step    Img/sec total_loss
1       images/sec: 283.3 +/- 0.0 (jitter = 0.0)        8.220
10      images/sec: 283.7 +/- 0.1 (jitter = 0.4)        7.880
20      images/sec: 283.7 +/- 0.1 (jitter = 0.3)        7.910
30      images/sec: 283.7 +/- 0.1 (jitter = 0.3)        7.820
40      images/sec: 283.7 +/- 0.1 (jitter = 0.3)        8.003
50      images/sec: 283.7 +/- 0.1 (jitter = 0.3)        7.768
60      images/sec: 283.7 +/- 0.1 (jitter = 0.4)        8.112
70      images/sec: 283.7 +/- 0.1 (jitter = 0.4)        7.814
80      images/sec: 283.2 +/- 0.2 (jitter = 0.5)        7.981
90      images/sec: 282.8 +/- 0.2 (jitter = 0.6)        8.093
100     images/sec: 282.4 +/- 0.2 (jitter = 0.9)        8.035
----------------------------------------------------------------
total images/sec: 282.36
----------------------------------------------------------------

ghostplant commented 5 years ago

@sunway513 Here are the numbers on FP32 with Fusion enabled.

Before

Done warm up
Step    Img/sec total_loss
1       images/sec: 277.0 +/- 0.0 (jitter = 0.0)        7.972
10      images/sec: 277.7 +/- 0.1 (jitter = 0.2)        7.856
20      images/sec: 277.8 +/- 0.1 (jitter = 0.4)        7.913
30      images/sec: 277.9 +/- 0.1 (jitter = 0.4)        7.731
40      images/sec: 278.0 +/- 0.1 (jitter = 0.5)        7.971
50      images/sec: 278.0 +/- 0.1 (jitter = 0.5)        8.027
60      images/sec: 278.0 +/- 0.0 (jitter = 0.4)        7.890
70      images/sec: 277.9 +/- 0.0 (jitter = 0.4)        7.983
80      images/sec: 277.9 +/- 0.0 (jitter = 0.5)        7.799
90      images/sec: 277.9 +/- 0.0 (jitter = 0.5)        7.792
100     images/sec: 277.9 +/- 0.0 (jitter = 0.5)        7.810
----------------------------------------------------------------
total images/sec: 277.79
----------------------------------------------------------------

After

Done warm up
Step    Img/sec total_loss
1       images/sec: 299.5 +/- 0.0 (jitter = 0.0)        7.972
10      images/sec: 300.4 +/- 0.1 (jitter = 0.1)        7.856
20      images/sec: 300.5 +/- 0.1 (jitter = 0.2)        7.914
30      images/sec: 300.6 +/- 0.1 (jitter = 0.3)        7.732
40      images/sec: 300.6 +/- 0.1 (jitter = 0.3)        7.972
50      images/sec: 300.7 +/- 0.0 (jitter = 0.3)        8.026
60      images/sec: 300.6 +/- 0.1 (jitter = 0.3)        7.894
70      images/sec: 300.6 +/- 0.0 (jitter = 0.3)        7.992
80      images/sec: 300.6 +/- 0.1 (jitter = 0.3)        7.803
90      images/sec: 300.6 +/- 0.0 (jitter = 0.3)        7.786
100     images/sec: 300.6 +/- 0.0 (jitter = 0.3)        7.795
----------------------------------------------------------------
total images/sec: 300.51
----------------------------------------------------------------

Is this config for batch_size = 128 and fp32 and fusion_enabled? If not, I think mine is 10% slower.

WannaBeOCer commented 5 years ago

@ghostplant That's correct, it's the result of batch_size = 128 and fp32 with fusion_enabled.

ghostplant commented 5 years ago

@WannaBeOCer OK~

sunway513 commented 5 years ago

@sunway513 Here are 128 batch size results with the power target at 300w with fusion enabled.

Hi @WannaBeOCer , could you help clarify if you have changed anything for the power limit? The Typical Board Power for RadeonVII is by default at 300W.

WannaBeOCer commented 5 years ago

@sunway513 For all the other results I left it at the default power target which is 250w. the Last result I provided I changed the power target to 300w.

sunway513 commented 5 years ago

@WannaBeOCer , if I understand it correctly, for the perf number show in this comment you have overdriven the GPU package power limit from the default 250W to 300W. That's an interesting experiment, thanks :-)

WannaBeOCer commented 5 years ago

@sunway513 That's correct, I wasn't sure if your test system used a Radeon VII or Mi50 that you were comparing the results with. The results I provided with the performance database is it performing correctly or still performing lower than you expected?

sunway513 commented 5 years ago

@WannaBeOCer , I was just curious about your description on the 300w power target :-) I believe your current software configuration are in good shape, thanks for posting!

robzor92 commented 5 years ago

ROCm 2.2 vs 2.3 on Radeon Vega VII:

FP 16 TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

ROCm 2.2 output:

Step    Img/sec total_loss
1   images/sec: 388.3 +/- 0.0 (jitter = 0.0)    8.235
10  images/sec: 386.5 +/- 0.6 (jitter = 1.3)    8.250
20  images/sec: 386.6 +/- 0.3 (jitter = 1.3)    8.262
30  images/sec: 386.6 +/- 0.3 (jitter = 1.3)    8.371
40  images/sec: 386.4 +/- 0.2 (jitter = 1.2)    8.233
50  images/sec: 386.3 +/- 0.2 (jitter = 1.3)    8.311
60  images/sec: 386.7 +/- 0.2 (jitter = 1.6)    8.203
70  images/sec: 386.8 +/- 0.3 (jitter = 2.2)    8.111
80  images/sec: 386.7 +/- 0.3 (jitter = 2.2)    8.235
90  images/sec: 386.4 +/- 0.2 (jitter = 1.9)    8.168
100 images/sec: 386.3 +/- 0.2 (jitter = 1.8)    8.212
----------------------------------------------------------------
total images/sec: 386.14
----------------------------------------------------------------

ROCm 2.3 output:

Step    Img/sec total_loss
1   images/sec: 410.9 +/- 0.0 (jitter = 0.0)    8.214
10  images/sec: 410.2 +/- 0.8 (jitter = 3.0)    8.175
20  images/sec: 409.4 +/- 0.5 (jitter = 2.0)    8.327
30  images/sec: 409.4 +/- 0.4 (jitter = 2.1)    8.181
40  images/sec: 409.8 +/- 0.4 (jitter = 2.4)    8.156
50  images/sec: 410.0 +/- 0.4 (jitter = 2.8)    8.397
60  images/sec: 409.9 +/- 0.3 (jitter = 3.0)    8.266
70  images/sec: 409.9 +/- 0.3 (jitter = 3.0)    8.156
80  images/sec: 410.1 +/- 0.3 (jitter = 3.1)    8.271
90  images/sec: 409.8 +/- 0.3 (jitter = 3.1)    8.321
100 images/sec: 409.9 +/- 0.3 (jitter = 3.1)    8.203
----------------------------------------------------------------
total images/sec: 409.76
----------------------------------------------------------------

FP 32 TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

ROCm 2.2 output:

Step    Img/sec total_loss
1   images/sec: 274.8 +/- 0.0 (jitter = 0.0)    8.324
10  images/sec: 274.0 +/- 0.4 (jitter = 0.7)    8.165
20  images/sec: 273.5 +/- 0.3 (jitter = 1.5)    8.253
30  images/sec: 273.3 +/- 0.2 (jitter = 1.7)    8.347
40  images/sec: 273.1 +/- 0.2 (jitter = 1.6)    8.412
50  images/sec: 272.8 +/- 0.2 (jitter = 1.4)    8.149
60  images/sec: 272.5 +/- 0.2 (jitter = 2.0)    8.326
70  images/sec: 272.5 +/- 0.2 (jitter = 1.8)    8.122
80  images/sec: 272.3 +/- 0.2 (jitter = 1.5)    8.412
90  images/sec: 272.3 +/- 0.1 (jitter = 1.5)    8.275
100 images/sec: 272.2 +/- 0.1 (jitter = 1.4)    8.329
----------------------------------------------------------------
total images/sec: 272.16
----------------------------------------------------------------

ROCm 2.3 output:

Step    Img/sec total_loss
1   images/sec: 293.9 +/- 0.0 (jitter = 0.0)    7.972
10  images/sec: 295.5 +/- 0.4 (jitter = 0.5)    7.856
20  images/sec: 295.7 +/- 0.3 (jitter = 1.0)    7.913
30  images/sec: 295.6 +/- 0.2 (jitter = 1.1)    7.734
40  images/sec: 295.6 +/- 0.2 (jitter = 0.9)    7.968
50  images/sec: 295.4 +/- 0.1 (jitter = 1.0)    8.027
60  images/sec: 295.3 +/- 0.1 (jitter = 1.1)    7.887
70  images/sec: 295.2 +/- 0.1 (jitter = 1.1)    7.978
80  images/sec: 295.2 +/- 0.1 (jitter = 1.1)    7.811
90  images/sec: 295.1 +/- 0.1 (jitter = 1.2)    7.786
100 images/sec: 295.0 +/- 0.1 (jitter = 1.3)    7.817
----------------------------------------------------------------
total images/sec: 294.93
----------------------------------------------------------------

Giving something like 5-10% performance increase, nice work!

It might be worth mentioning that to test ROCm 2.2 I used TensorFlow 1.11.0 and ROCm 2.3 with TensorFlow 1.13.1 due to compatibility reasons.

kinred commented 5 years ago

Radeon VII

Update with ROCm 2.4 and Tensorflow 1.13.3:

FP32

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size=128

70  images/sec: 311.0 +/- 0.1 (jitter = 0.6)    8.290
80  images/sec: 310.9 +/- 0.1 (jitter = 0.7)    8.306
90  images/sec: 310.8 +/- 0.1 (jitter = 0.7)    8.136
100 images/sec: 310.8 +/- 0.1 (jitter = 0.7)    8.447
----------------------------------------------------------------
total images/sec: 310.74
----------------------------------------------------------------

FP16

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size=128 --use_fp16=true

70  images/sec: 443.9 +/- 0.1 (jitter = 0.8)    8.272
80  images/sec: 443.7 +/- 0.1 (jitter = 0.8)    8.189
90  images/sec: 443.6 +/- 0.1 (jitter = 1.0)    8.293
100 images/sec: 443.5 +/- 0.1 (jitter = 1.1)    8.289
----------------------------------------------------------------
total images/sec: 443.42
----------------------------------------------------------------

Also RNN performance made a jump. Great improvements!

sunway513 commented 5 years ago

Hi @kinred , could you help clarify if you have changed any ROCm default settings, e.g. power target?

kinred commented 5 years ago

@sunway513, I did no specific tuning.

Running Ubuntu 18.04.2 LTS (4.15.0-48-generic) bare metal with rocm-dkms 2.4.25 packages.

After running "rocm-smi -d 0 --resetprofile" I get reproducible the same results. A log of "rocm-smi -a" is attached.

radeon_vii_rocm_smi.log

Any specific info I could look up for you?

sunway513 commented 5 years ago

Thanks @kinred , could you set the following option and re-collect your result? /opt/rocm/bin/rocm-smi --setperf auto

kinred commented 5 years ago

Hi @sunway513, i did the above command, it stated:

========================ROCm System Management Interface========================
================================================================================
GPU[0]      : Successfully set current Performance Level to auto
================================================================================
==============================End of ROCm SMI Log ==============================

Re-run the benchmark and get similar results:

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --model resnet50 --batch_size=128

70  images/sec: 311.3 +/- 0.1 (jitter = 0.4)    8.281
80  images/sec: 311.2 +/- 0.1 (jitter = 0.4)    8.307
90  images/sec: 311.2 +/- 0.1 (jitter = 0.4)    8.122
100 images/sec: 311.2 +/- 0.1 (jitter = 0.4)    8.447
----------------------------------------------------------------
total images/sec: 311.12
----------------------------------------------------------------

Are the results reproducible on your side?

ffleader1 commented 5 years ago

@kazulittlefox , or please anyone with a Rx 580, can you help me with benchmarking VGG16 (TF 1.12 preferably, BS=32, F32). I despairingly need this result for my personal work, but I have a VEGA 56 so I can't really do anything.

NcuLz commented 5 years ago

Hi @ffleader1 ,I have a RX580,this my result for you. ubuntu 18.04 ROCm 2.5 tf-1.13.1 2019-06-17 20-42-53屏幕截图 2019-06-17 20-42-40屏幕截图

Daniel451 commented 5 years ago

@kinred could you (or others) add more benchmarks for the Radeon VII? How's the stability so far? Any downsides?

Would be interesting to see the performance in really deep networks with the lot of architectural stuff used (residual connections/conacts, lots of & different convolutions, ...). For example, could you test InceptionV3 or V4 performance?

Slightly over 300 img/s in ResNet50 sounds really good since even the GTX 2080 Ti is only at 326 img/s (although I only saw batch size 64 tests, probably the 11GB VRAM does not allow for 128; you can find one example here Exxactcorp NVIDIA RTX 2080 Ti Benchmarks).

The 2080 Ti is only leading in FP16 (over 800 img/s with batch size 128).

sebpuetz commented 5 years ago

@kinred could you (or others) add more benchmarks for the Radeon VII? How's the stability so far? Any downsides?

I can mostly comment on stability, I don't do image processing, so I can't give insights on architectures/models used for those tasks.

You might want to check out #325, I opened the bug report more than 4 months ago. The last time something happened was about 6 weeks ago, but there's no fix or even an explanation in sight.

414 describes some other (probably temp-related) issues, I'm unsure whether that was resolved.

With the most recent ROCm update my system becomes unresponsive on an RNN that previously worked fine, haven't dug deeper into it but I can get the system to become responsive again by killing the process.

I haven't seen many other people complaining about issues with the VII, so I guess, depending on your use-case, your mileage may vary.

alexanderkjeldaas commented 5 years ago

Any resnet50 benchmarks for ROCm 2.5 and TF 2.0?

WannaBeOCer commented 5 years ago

Radeon VII with ROCm 2.6 and TF 2.0

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16

Step    Img/sec total_loss
1   images/sec: 400.8 +/- 0.0 (jitter = 0.0)    8.104
10  images/sec: 399.9 +/- 0.3 (jitter = 0.5)    7.757
20  images/sec: 400.0 +/- 0.2 (jitter = 0.5)    7.913
30  images/sec: 399.8 +/- 0.2 (jitter = 0.6)    7.771
40  images/sec: 399.7 +/- 0.1 (jitter = 0.5)    7.920
50  images/sec: 399.7 +/- 0.1 (jitter = 0.6)    7.886
60  images/sec: 399.7 +/- 0.1 (jitter = 0.5)    7.710
70  images/sec: 399.7 +/- 0.1 (jitter = 0.6)    8.007
80  images/sec: 399.9 +/- 0.2 (jitter = 0.6)    7.780
90  images/sec: 400.1 +/- 0.2 (jitter = 0.7)    7.798
100 images/sec: 400.1 +/- 0.2 (jitter = 0.8)    8.035
----------------------------------------------------------------
total images/sec: 399.77
----------------------------------------------------------------

ROCm / tensorflow-upstream

Performance comparsion: AMD with ROCm vs NVIDIA with cuDNN? #173

Bare metal setup

Docker setup

414 describes some other (probably temp-related) issues, I'm unsure whether that was resolved.