ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
682 stars 93 forks source link

Performance comparsion: AMD with ROCm vs NVIDIA with cuDNN? #173

Open NIIAS3050 opened 5 years ago

NIIAS3050 commented 5 years ago

It would be very useful to compare real training performance on amd and nvidia cards. For Nvidia cards we have a lot of graphs and tests, for example: https://github.com/u39kun/deep-learning-benchmark But for AMD cards there is no performance metrics. It will be great to made direct comparsion between AND and NVIDIA with last cuDNN.

pricebenjamin commented 5 years ago

If you happen to have access to some AMD GPUs that are supported by the ROCm stack, consider running some benchmarks from the TensorFlow benchmarks repository. The README in the benchmarks/scripts/tf_cnn_benchmarks directory provides some example usage.

Those scripts were used for the benchmarks shown on TensorFlows website.

I've run the following on a Vega FE (tensorflow-rocm==1.11.0 and rocm-dkms==1.9.211).

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

This yields the following.

[...]
Done warm up
Step    Img/sec total_loss
1   images/sec: 182.2 +/- 0.0 (jitter = 0.0)    8.325
10  images/sec: 182.3 +/- 0.1 (jitter = 0.2)    8.170
20  images/sec: 182.3 +/- 0.1 (jitter = 0.3)    8.247
30  images/sec: 182.1 +/- 0.1 (jitter = 0.3)    8.369
40  images/sec: 182.0 +/- 0.1 (jitter = 0.4)    8.401
50  images/sec: 181.9 +/- 0.1 (jitter = 0.5)    8.147
60  images/sec: 181.8 +/- 0.1 (jitter = 0.6)    8.340
70  images/sec: 181.6 +/- 0.1 (jitter = 0.7)    8.120
80  images/sec: 181.3 +/- 0.2 (jitter = 0.9)    8.415
90  images/sec: 180.5 +/- 0.3 (jitter = 1.1)    8.278
100 images/sec: 179.5 +/- 0.4 (jitter = 1.4)    8.328
----------------------------------------------------------------
total images/sec: 179.44
----------------------------------------------------------------

For comparison, the same command being run on a Tesla P100-PCIE-16GB (CUDA==9.2, cuDNN==7.1.4, and tf.__version__ == '1.11.0')

[...]
Done warm up
Step    Img/sec total_loss
1   images/sec: 248.6 +/- 0.0 (jitter = 0.0)    8.325
10  images/sec: 248.6 +/- 0.2 (jitter = 0.6)    8.164
20  images/sec: 248.5 +/- 0.1 (jitter = 0.8)    8.251
30  images/sec: 248.4 +/- 0.1 (jitter = 0.7)    8.355
40  images/sec: 248.3 +/- 0.1 (jitter = 0.6)    8.417
50  images/sec: 248.2 +/- 0.1 (jitter = 0.6)    8.152
60  images/sec: 248.2 +/- 0.1 (jitter = 0.6)    8.353
70  images/sec: 248.1 +/- 0.1 (jitter = 0.7)    8.109
80  images/sec: 247.7 +/- 0.1 (jitter = 0.8)    8.405
90  images/sec: 247.5 +/- 0.1 (jitter = 0.9)    8.266
100 images/sec: 247.2 +/- 0.2 (jitter = 1.2)    8.344
----------------------------------------------------------------
total images/sec: 247.13
----------------------------------------------------------------

Bear in mind, I haven't done anything to try and optimize performance on the Vega FE. These are essentially "out-of-the-box" results.

Mandrewoid commented 5 years ago

@pricebenjamin when I try to run that same script ( I cloned the repo ) I get an import error:

ImportError: No module named 'tensorflow.python.data.experimental'

pricebenjamin commented 5 years ago

@Mandrewoid, if you haven't already, I'd recommend checking out the branch corresponding to your version of tensorflow, e.g.

cd /path/to/benchmarks
git checkout cnn_tf_v1.11_compatible
Mandrewoid commented 5 years ago

Nice that seems to have done it. I did not realize mainline TF had already advanced to 1.12 rookie mistake

kazulittlefox commented 5 years ago

I have tried runnning benchmarks on my environment(Kernel 4.15, ROCm1.9.2, TF1.12 with RX 580).

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=(32|64)  \ 
--model=(alexnet|inceptionv3|vgg16|googlenet|resnet50)

result are as follow:

AlexNet        batch:32 397.27/sec
                     batch:64 518.03/sec
InceptionV3 batch:32   47.78/sec
                    batch:64   50.66/sec
googLeNet batch:32 239.28/sec
                   batch:64 256.05/sec
ResNet50   batch:32  86.81/sec
                 batch:64  98.57/sec

In my environment, Vgg16 has not runnning well.

fshi98 commented 5 years ago

I have tested with vega64, ubuntu18.04, ROCm1.9.2, tf1.12: 1 resnet50: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 1080ti: 212 images/sec (278 fp16)
vega64: 191 images/sec (190 fp16) 2 resnet101: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet101 1080ti: 121.14 images/sec (168 fp16)
vega64: 101.15 images/sec (93 fp16), if fp16, --batch_size could be 64, while fp32, 64 will crash

  1. inception3: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=inception3 1080ti: 140.08 images/sec (166 fp16)
    vega64: 99.02 images/sec (50 fp16)

4 mobilenet: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=mobilenet 1080ti: 2865 images/sec
vega64: 462 images/sec

The nv gtx1080 ti was tested on another machine with cuda10, ubuntu 18.04.

There are two values didn't add up:

  1. for mobilenet, the 1080ti result doesn't make sense.
  2. i also tested with --use_fp16, which gives fair amount of speedup for 1080ti. However, for vega64, it ends up slower in all tests if using --use_fp16. This is especially true for inception3.

Considering vega64 supports native half precision and fp16 should be a good selling point for amd vega. how is it slower if using fp16? I guess this is probably due to software support, especially ROCm. Can anyone please test it with --use_fp16 and see if having similar results.

@kazulittlefox my vega runs smoothly with vgg16 @105images/sec

Mandrewoid commented 5 years ago

@fshi98 that might be because of https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/pull/143#issuecomment-415798399

fshi98 commented 5 years ago

@Mandrewoid Thanks. That may be the reason. However, my rocblas version is 0.14.3.0, and I tested //tensorflow/python/kernel_tests:batch_matmul_op_test, and passed all 47 tests in 10.653s as in #143 Also, i tested and passed https://github.com/ROCmSoftwarePlatform/rocBLAS/issues/340

This may not be the same error bugs as #143, but may be some performance issues

pricebenjamin commented 5 years ago

@sebpuetz Would you be willing to post some numbers for the Radeon VII, including fp16 performance? I have yet to find any cloud providers with these cards. Trying to get some info for #288.

sebpuetz commented 5 years ago

288

Radeon VII rocm==2.1.96 installed through apt tensorflow==1.12 installed through pip no further tuning

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Step    Img/sec total_loss
1   images/sec: 190.3 +/- 0.0 (jitter = 0.0)    8.217
10  images/sec: 195.7 +/- 0.9 (jitter = 3.1)    8.123
20  images/sec: 196.4 +/- 0.5 (jitter = 1.8)    8.231
30  images/sec: 196.8 +/- 0.4 (jitter = 1.1)    8.268
40  images/sec: 197.1 +/- 0.3 (jitter = 0.9)    8.355
50  images/sec: 197.2 +/- 0.2 (jitter = 0.8)    8.013
60  images/sec: 197.3 +/- 0.2 (jitter = 0.7)    8.263
70  images/sec: 196.8 +/- 0.3 (jitter = 1.1)    8.304
80  images/sec: 196.9 +/- 0.2 (jitter = 1.1)    8.228
90  images/sec: 196.9 +/- 0.2 (jitter = 0.9)    8.283
100 images/sec: 197.0 +/- 0.2 (jitter = 0.8)    8.271
----------------------------------------------------------------
total images/sec: 196.98
----------------------------------------------------------------

FP16:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50  --use_fp16
Step    Img/sec total_loss
1   images/sec: 262.9 +/- 0.0 (jitter = 0.0)    8.162
10  images/sec: 261.9 +/- 0.6 (jitter = 0.7)    8.211
20  images/sec: 260.4 +/- 0.6 (jitter = 2.6)    8.375
30  images/sec: 260.6 +/- 0.5 (jitter = 2.6)    8.264
40  images/sec: 259.6 +/- 0.6 (jitter = 3.1)    8.116
50  images/sec: 259.6 +/- 0.5 (jitter = 3.1)    8.169
60  images/sec: 259.9 +/- 0.5 (jitter = 2.6)    8.325
70  images/sec: 259.3 +/- 0.5 (jitter = 3.5)    8.374
80  images/sec: 259.4 +/- 0.4 (jitter = 3.4)    8.041
90  images/sec: 259.3 +/- 0.4 (jitter = 3.6)    8.298
100 images/sec: 259.4 +/- 0.3 (jitter = 3.5)    8.376
----------------------------------------------------------------
total images/sec: 259.29
----------------------------------------------------------------

This one made the GPU sound like a jet engine:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step    Img/sec total_loss
1   images/sec: 216.3 +/- 0.0 (jitter = 0.0)    8.219
10  images/sec: 215.9 +/- 0.3 (jitter = 0.3)    8.289
20  images/sec: 216.0 +/- 0.2 (jitter = 0.3)    8.064
30  images/sec: 215.9 +/- 0.1 (jitter = 0.3)    8.310
40  images/sec: 215.9 +/- 0.1 (jitter = 0.3)    8.197
50  images/sec: 215.9 +/- 0.1 (jitter = 0.3)    8.277
60  images/sec: 215.7 +/- 0.1 (jitter = 0.4)    8.162
70  images/sec: 215.7 +/- 0.1 (jitter = 0.4)    8.159
80  images/sec: 215.7 +/- 0.1 (jitter = 0.4)    8.139
90  images/sec: 215.7 +/- 0.1 (jitter = 0.4)    8.196
100 images/sec: 215.7 +/- 0.1 (jitter = 0.4)    8.163
----------------------------------------------------------------
total images/sec: 215.72
----------------------------------------------------------------

FP 16:

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step    Img/sec total_loss
1   images/sec: 288.2 +/- 0.0 (jitter = 0.0)    8.209
10  images/sec: 283.8 +/- 1.1 (jitter = 2.7)    8.189
20  images/sec: 284.0 +/- 0.9 (jitter = 4.6)    8.316
30  images/sec: 284.9 +/- 0.7 (jitter = 4.5)    8.195
40  images/sec: 284.5 +/- 0.6 (jitter = 4.0)    8.180
50  images/sec: 284.3 +/- 0.5 (jitter = 3.7)    8.402
60  images/sec: 285.0 +/- 0.5 (jitter = 4.8)    8.271
70  images/sec: 285.4 +/- 0.4 (jitter = 3.7)    8.134
80  images/sec: 285.7 +/- 0.4 (jitter = 2.7)    8.299
90  images/sec: 286.0 +/- 0.4 (jitter = 1.5)    8.349
100 images/sec: 286.2 +/- 0.3 (jitter = 1.4)    8.213
----------------------------------------------------------------
total images/sec: 286.17
----------------------------------------------------------------
sunway513 commented 5 years ago

Hi @sebpuetz , maybe you can also try to enable fusion support :-) Doc is as following: https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/rocm-port-overview.md#fusion-support

sebpuetz commented 5 years ago

Improvements across the board with TF_ROCM_FUSION_ENABLE=1. The displayed temp in rocm-smi went above 90°C on all tests, the rocm-smi output didn't include clocks so I can't tell whether any termal throttling was happening.

TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Step    Img/sec total_loss
1   images/sec: 208.4 +/- 0.0 (jitter = 0.0)    8.217
10  images/sec: 207.6 +/- 0.5 (jitter = 0.5)    8.124
20  images/sec: 207.7 +/- 0.3 (jitter = 0.5)    8.235
30  images/sec: 207.3 +/- 0.4 (jitter = 0.4)    8.268
40  images/sec: 207.2 +/- 0.4 (jitter = 0.4)    8.357
50  images/sec: 207.2 +/- 0.4 (jitter = 0.4)    8.012
60  images/sec: 207.2 +/- 0.3 (jitter = 0.4)    8.248
70  images/sec: 207.1 +/- 0.3 (jitter = 0.4)    8.305
80  images/sec: 207.0 +/- 0.3 (jitter = 0.5)    8.223
90  images/sec: 205.7 +/- 0.9 (jitter = 0.5)    8.322
100 images/sec: 205.7 +/- 0.8 (jitter = 0.5)    8.268
----------------------------------------------------------------
total images/sec: 205.65
----------------------------------------------------------------
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16
Step    Img/sec total_loss
1   images/sec: 273.0 +/- 0.0 (jitter = 0.0)    8.171
10  images/sec: 272.6 +/- 0.9 (jitter = 1.0)    8.223
20  images/sec: 271.5 +/- 1.1 (jitter = 0.9)    8.375
30  images/sec: 272.0 +/- 0.8 (jitter = 0.9)    8.282
40  images/sec: 272.1 +/- 0.6 (jitter = 0.9)    8.122
50  images/sec: 272.1 +/- 0.6 (jitter = 0.8)    8.144
60  images/sec: 272.0 +/- 0.5 (jitter = 0.8)    8.333
70  images/sec: 271.5 +/- 0.5 (jitter = 1.0)    8.357
80  images/sec: 271.2 +/- 0.5 (jitter = 1.3)    8.034
90  images/sec: 271.2 +/- 0.4 (jitter = 1.3)    8.289
100 images/sec: 270.9 +/- 0.4 (jitter = 1.5)    8.361
----------------------------------------------------------------
total images/sec: 270.81
----------------------------------------------------------------
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step    Img/sec total_loss
1   images/sec: 227.7 +/- 0.0 (jitter = 0.0)    8.221
10  images/sec: 225.6 +/- 0.5 (jitter = 2.2)    8.289
20  images/sec: 225.5 +/- 0.4 (jitter = 1.9)    8.068
30  images/sec: 225.7 +/- 0.3 (jitter = 1.8)    8.304
40  images/sec: 225.4 +/- 0.5 (jitter = 1.2)    8.183
50  images/sec: 225.5 +/- 0.4 (jitter = 1.0)    8.261
60  images/sec: 225.6 +/- 0.4 (jitter = 1.1)    8.203
70  images/sec: 225.6 +/- 0.3 (jitter = 1.1)    8.165
80  images/sec: 225.6 +/- 0.3 (jitter = 1.0)    8.168
90  images/sec: 225.7 +/- 0.3 (jitter = 1.0)    8.196
100 images/sec: 225.6 +/- 0.2 (jitter = 1.1)    8.138
----------------------------------------------------------------
total images/sec: 225.62
----------------------------------------------------------------
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step    Img/sec total_loss
1   images/sec: 302.0 +/- 0.0 (jitter = 0.0)    8.213
10  images/sec: 300.2 +/- 0.5 (jitter = 1.5)    8.181
20  images/sec: 298.7 +/- 0.8 (jitter = 2.5)    8.324
30  images/sec: 297.7 +/- 0.8 (jitter = 2.2)    8.197
40  images/sec: 297.7 +/- 0.6 (jitter = 3.0)    8.173
50  images/sec: 297.9 +/- 0.6 (jitter = 3.0)    8.400
60  images/sec: 297.9 +/- 0.5 (jitter = 3.0)    8.267
70  images/sec: 298.4 +/- 0.5 (jitter = 2.8)    8.140
80  images/sec: 298.6 +/- 0.4 (jitter = 2.7)    8.283
90  images/sec: 298.6 +/- 0.4 (jitter = 2.8)    8.337
100 images/sec: 298.7 +/- 0.4 (jitter = 2.6)    8.208
----------------------------------------------------------------
total images/sec: 298.60
----------------------------------------------------------------
sunway513 commented 5 years ago

Hi @sebpuetz , thanks for the update! However, the performance numbers seem not right. Can you provide me the VBIOS version of your board? The following command would do: /opt/rocm/bin/rocm-smi -v

sebpuetz commented 5 years ago
/opt/rocm/bin/rocm-smi -v 
GPU[0]      : VBIOS version: 113-D3600200-105
WrightChen commented 5 years ago

Radeon RX Vega 64 memoryClockRate (GHz) 1.63 Total memory: 7.98GiB Free memory: 7.73GiB rocm==2.1.96 installed through apt tensorflow==1.12 installed through pip

Some Frameworks use option ' TF_ROCM_FUSION_ENABLE=1 ' doesn't change much, so I'm not giving the FUSION = 1 results. Due to lack of memory, there are some frameworks can't run on the batch_size=128.

  ResNet50 AlexNet Inception v3 VGG16 GoogLeNet ResNet152
batch_size=512 / 1573.01 / / / /
batch_size=256 / 1420.65 / / / /
batch_size=128 / 1345.73 / / 498.73 /
batch_size=64 190.58 1151.98 103.82 101.95 474.07 /
batch_size=32 171.70 971.85 98.50 91.80 424.32 68.71
batch_size=128; FUSION = 1 / / / / / /
batch_size=64; FUSION = 1 208.78 / 109.66 / / /
batch_size=32; FUSION = 1 187.76 / 105.20 / / 75.81
sunway513 commented 5 years ago

Hi @sebpuetz , could you try to refresh your performance numbers using our official docker image? If you've not configured the docker, the following script should do: curl -sSL https://get.docker.com/ | sh

To run the benchmarks inside docker image:

alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx -v /data/imagenet/tf:/imagenet'
drun rocm/tensorflow:rocm2.1-tf1.12-python3
cd ~/benchmarks/scripts/tf_cnn_benchmarks
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16

Thanks for your attention, and looking forward to your updates :-)

jimdowling commented 5 years ago

6-core Intel i7 8700 with 16GB ram, and 400GB SSD disk. Radeon VII rocm==2.1.96 installed through apt tensorflow==1.12 installed through pip no further tuning

TC_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 Step Img/sec total_loss 1 images/sec: 250.0 +/- 0.0 (jitter = 0.0) 8.348 10 images/sec: 248.0 +/- 1.4 (jitter = 0.7) 8.144 20 images/sec: 248.7 +/- 0.8 (jitter = 0.4) 8.440 30 images/sec: 248.8 +/- 0.6 (jitter = 0.4) 8.140 40 images/sec: 248.7 +/- 0.6 (jitter = 0.4) 8.474 50 images/sec: 248.5 +/- 0.5 (jitter = 0.4) 8.322 60 images/sec: 248.5 +/- 0.5 (jitter = 0.5) 8.317 70 images/sec: 248.5 +/- 0.4 (jitter = 0.6) 8.010 80 images/sec: 248.4 +/- 0.4 (jitter = 0.6) 8.272 90 images/sec: 248.5 +/- 0.4 (jitter = 0.6) 8.289 100 images/sec: 248.4 +/- 0.3 (jitter = 0.6) 8.108

total images/sec: 248.34

TC_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 Step Img/sec total_loss 1 images/sec: 265.1 +/- 0.0 (jitter = 0.0) 8.324 10 images/sec: 264.3 +/- 0.5 (jitter = 0.3) 8.168 20 images/sec: 264.5 +/- 0.3 (jitter = 0.2) 8.261 30 images/sec: 264.4 +/- 0.3 (jitter = 0.3) 8.377 40 images/sec: 264.2 +/- 0.2 (jitter = 0.4) 8.408 50 images/sec: 264.1 +/- 0.2 (jitter = 0.5) 8.160 60 images/sec: 263.9 +/- 0.2 (jitter = 0.6) 8.341 70 images/sec: 263.8 +/- 0.2 (jitter = 0.6) 8.107 80 images/sec: 263.8 +/- 0.2 (jitter = 0.8) 8.404 90 images/sec: 263.8 +/- 0.2 (jitter = 0.7) 8.296 100 images/sec: 263.7 +/- 0.2 (jitter = 0.6) 8.348

total images/sec: 263.65

With a batch size of 256, i get out of memory errors. Funnily enough with a batch size of 155, it works, but is slower.

TC_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=155 --model=resnet50

Step Img/sec total_loss 1 images/sec: 195.3 +/- 0.0 (jitter = 0.0) 8.394 10 images/sec: 194.6 +/- 0.7 (jitter = 0.6) 8.313 20 images/sec: 194.5 +/- 0.5 (jitter = 0.6) 8.154 30 images/sec: 194.4 +/- 0.3 (jitter = 0.7) 8.249 40 images/sec: 194.5 +/- 0.3 (jitter = 0.8) 8.165 50 images/sec: 194.4 +/- 0.2 (jitter = 1.0) 8.292 60 images/sec: 194.3 +/- 0.2 (jitter = 1.0) 8.340 70 images/sec: 194.3 +/- 0.2 (jitter = 0.9) 8.268 80 images/sec: 194.2 +/- 0.2 (jitter = 0.8) 8.227 90 images/sec: 194.2 +/- 0.2 (jitter = 0.8) 8.257 100 images/sec: 194.1 +/- 0.2 (jitter = 0.9) 8.183

total images/sec: 194.04

jimdowling commented 5 years ago

Leaving out TC_ROCM_FUSION_ENABLE does not make any difference. /opt/rocm/bin/rocm-smi -v VBIOS version: 113-D3600200-105

jimdowling commented 5 years ago

According to this blog, https://www.pugetsystems.com/labs/hpc/NVIDIA-RTX-2080-Ti-vs-2080-vs-1080-Ti-vs-Titan-V-TensorFlow-Performance-with-CUDA-10-0-1247/, the 2080Ti gets 280 images/sec and the 1080Ti gets 207 images/sec for FP32 training.

jimdowling commented 5 years ago

One more: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16 Step Img/sec total_loss 1 images/sec: 377.7 +/- 0.0 (jitter = 0.0) 8.246 10 images/sec: 375.9 +/- 2.2 (jitter = 0.7) 8.261 20 images/sec: 377.9 +/- 1.2 (jitter = 0.9) 8.279 30 images/sec: 378.3 +/- 0.9 (jitter = 0.9) 8.365 40 images/sec: 378.2 +/- 0.7 (jitter = 0.5) 8.237 50 images/sec: 378.3 +/- 0.6 (jitter = 0.4) 8.295 60 images/sec: 378.4 +/- 0.5 (jitter = 0.4) 8.203 70 images/sec: 378.4 +/- 0.5 (jitter = 0.5) 8.129 80 images/sec: 377.9 +/- 0.6 (jitter = 0.6) 8.264 90 images/sec: 378.0 +/- 0.5 (jitter = 0.8) 8.163 100 images/sec: 377.9 +/- 0.5 (jitter = 0.8) 8.239

total images/sec: 377.79

Sumenia commented 5 years ago

@jimdowling that's some impressive perf !

sebpuetz commented 5 years ago

@jimdowling these numbers seem substantially higher than the ones I got, what OS and kernel are you on?

sebpuetz commented 5 years ago

Hi, I executed the benchmarks in the docker container:

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step    Img/sec total_loss
1   images/sec: 229.7 +/- 0.0 (jitter = 0.0)    8.221
10  images/sec: 225.4 +/- 0.8 (jitter = 2.7)    8.289
20  images/sec: 225.9 +/- 0.5 (jitter = 3.6)    8.054
30  images/sec: 226.6 +/- 0.4 (jitter = 2.1)    8.313
40  images/sec: 226.9 +/- 0.3 (jitter = 0.8)    8.187
50  images/sec: 227.2 +/- 0.3 (jitter = 0.7)    8.240
60  images/sec: 227.3 +/- 0.2 (jitter = 0.5)    8.192
70  images/sec: 227.4 +/- 0.2 (jitter = 0.5)    8.143
80  images/sec: 227.6 +/- 0.2 (jitter = 0.5)    8.150
90  images/sec: 227.6 +/- 0.2 (jitter = 0.5)    8.217
100 images/sec: 227.7 +/- 0.2 (jitter = 0.5)    8.163
----------------------------------------------------------------
total images/sec: 227.66
----------------------------------------------------------------

and

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step    Img/sec total_loss
1   images/sec: 300.8 +/- 0.0 (jitter = 0.0)    8.205
10  images/sec: 300.3 +/- 0.4 (jitter = 0.2)    8.170
20  images/sec: 300.3 +/- 0.3 (jitter = 0.5)    8.317
30  images/sec: 300.5 +/- 0.2 (jitter = 0.6)    8.201
40  images/sec: 300.6 +/- 0.2 (jitter = 0.5)    8.176
50  images/sec: 300.5 +/- 0.2 (jitter = 0.5)    8.398
60  images/sec: 300.3 +/- 0.2 (jitter = 0.5)    8.268
70  images/sec: 300.3 +/- 0.2 (jitter = 0.6)    8.140
80  images/sec: 300.4 +/- 0.2 (jitter = 0.6)    8.279
90  images/sec: 300.4 +/- 0.2 (jitter = 0.6)    8.328
100 images/sec: 300.3 +/- 0.2 (jitter = 0.6)    8.214
----------------------------------------------------------------
total images/sec: 300.29
----------------------------------------------------------------

@sunway513 these numbers are still pretty far away from what @jimdowling got, do you see a reason for this to happen?

jimdowling commented 5 years ago

Ubuntu 18.04. Python 2.7. Kernel is 4.15. I was not running Docker - bare metal.

sunway513 commented 5 years ago

Hi @jimdowling , Thanks for your posting! However, it seems there's a typo in your script, therefore TF fusion is not really enabled there. Could you try the following command again? TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16 If fusion is enabled, you should see the following message at the run time: 2019-02-21 13:41:32.304325: I tensorflow/core/graph/gpu_fusion_pass.cc:454] ROCm Fusion is enabled.

sunway513 commented 5 years ago

Hi @sebpuetz , thanks for your updated numbers with docker! in a parallel issue, you mentioned your system is Linux Mint 19.1, is that the same OS you ran the benchmark? May I know the kernel and driver version of your configurations? The following command would help: uname -a apt --installed list | grep rock-dkms I believe your user-bit components were properly configured, as you got similar perf numbers using our official docker image. VBIOS version is good as well. We need to look into kernels and firmware.

sebpuetz commented 5 years ago

Hi @sunway513 , I ran all benchmarks on Linux Mint 19.1

uname -a
Linux seb-desktop 4.20.7-042007-generic #201902061234 SMP Wed Feb 6 17:36:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
apt list --installed | grep rock-dkms
rock-dkms/Ubuntu 16.04,now 2.1-96 all [installed]

Linux Mint 19.1 is based on Ubuntu 18.04, so this looks like a mismatch here?

ghostplant commented 5 years ago

@sunway513

I am also using RX Vega 64 but I got such warning:

2019-02-21 14:26:23.732074: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter
warning: <unknown>:0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
2019-02-21 14:26:27.702436: I tensorflow/core/kernels/conv_grad_input_ops.cc:1023] running auto-tune for Backward-Data
2019-02-21 14:26:29.084753: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter
warning: <unknown>:0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
2019-02-21 14:26:33.818470: I tensorflow/core/kernels/conv_grad_input_ops.cc:1023] running auto-tune for Backward-Data
2019-02-21 14:26:33.839322: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter

And the performance is ~10% loss compared with others' benchmark:

Step    Img/sec total_loss
1       images/sec: 182.8 +/- 0.0 (jitter = 0.0)        8.217
10      images/sec: 187.2 +/- 0.9 (jitter = 0.7)        8.122
20      images/sec: 187.3 +/- 0.5 (jitter = 0.7)        8.229
30      images/sec: 187.1 +/- 0.4 (jitter = 0.9)        8.264
40      images/sec: 187.0 +/- 0.4 (jitter = 0.9)        8.347
50      images/sec: 187.0 +/- 0.3 (jitter = 1.1)        8.014
60      images/sec: 187.0 +/- 0.3 (jitter = 1.0)        8.264
70      images/sec: 186.8 +/- 0.3 (jitter = 1.1)        8.316
80      images/sec: 186.7 +/- 0.3 (jitter = 1.1)        8.231
90      images/sec: 186.7 +/- 0.2 (jitter = 1.2)        8.305

But it should be expected to have about 207 images/sec. Is it influenced by the warning above and how to fix the performance?

sunway513 commented 5 years ago

Hi @sebpuetz , yes, there should be a mismatch here. Theoretically, you don't need to install rock-dkms driver if your system is on 4.20 kernel, you can refer to the following doc: https://github.com/RadeonOpenCompute/ROCm#rocm-support-in-upstream-linux-kernels The behavior can be "undefined" if you install rock-dkms driver on 4.20 kernel, as that's not in our test coverage. Can you open another GitHub issue and let's track it there?

ghostplant commented 5 years ago

@WrightChen @sebpuetz Can you share the linux kernel and rocm-dkms version? I am using rocm-dkms==2.0.89 and linux-kernel=4.15.

sunway513 commented 5 years ago

Hi @ghostplant , the warning is normal in ROCm2.1, and that's from the upstream LLVM compiler. The performance shouldn't be affected by that. To triage your perf issue, could you try to run your benchmarks inside docker container and provide us the kernel/rock-dkms versions on your bare metal system?

ghostplant commented 5 years ago

@sunway513 the docker image I am using is rocm/tensorflow:rocm2.1-tf1.12-python3 but the rocm-dkms outside docker is 2.0.89.

sunway513 commented 5 years ago

@ghostplant , missed your last message ;-) Can you post the log of the following command: cat /sys/module/amdgpu/parameters/noretry

ghostplant commented 5 years ago

@sunway513 Yeah, it is:

root@amdgpu:~# cat /sys/module/amdgpu/parameters/noretry
0
sunway513 commented 5 years ago

@ghostplant , thanks, that's the issue. In rocm2.0 the kernel driver disabled noretry bit by default, that can impact the performance on gfx900/gfx906; the issue is fixed in ROCm2.1. To enable noretry bit, you can try the following command:

sudo -s
echo 1 > /sys/module/amdgpu/parameters/noretry
exit

I'd suggest you upgrade to rocm2.1 to permantely resolve it. For ROCm2.0, the following command could enable noretry as a default kernel option:

sudo sh -c 'echo "options amdgpu noretry=1"        >> /etc/modprobe.d/amdgpu.conf'
sudo update-initramfs -u
sudo reboot
ghostplant commented 5 years ago

@sunway513 Yeah, it does get a slight boost for fusion-enabled case which is around 5%. If I enable fusion, the performance of resnet50 get raised from 187.0 to 202.0 images/sec:

40      images/sec: 202.0 +/- 0.3 (jitter = 0.9)        8.379
50      images/sec: 202.0 +/- 0.3 (jitter = 1.0)        8.007
60      images/sec: 201.9 +/- 0.3 (jitter = 1.0)        8.285

If I disable fusion, the performance of resnet50 is still bad:

40      images/sec: 181.7 +/- 0.3 (jitter = 0.8)        8.359
50      images/sec: 181.7 +/- 0.3 (jitter = 0.9)        7.994
60      images/sec: 181.6 +/- 0.3 (jitter = 0.9)        8.268
70      images/sec: 181.5 +/- 0.2 (jitter = 0.8)        8.291

I wonder how to reach the benchmark of 208 images/sec or even 226.6 images/sec? Any other possibilities? I had update rocm-dkms to 2.1 and fully reboot the system, and also with noretry=1 set.

Thanks!

ghostplant commented 5 years ago

@sunway513 Do you think I should upgrade llvm and compile a native tensorflow version?

ghostplant commented 5 years ago

I saw the kernel fusion makes Nasnet model very slow:

TF_ROCM_FUSION_ENABLE=1 python3 benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=nasnet --forward_only

while disabling fusion is far better:

unset TF_ROCM_FUSION_ENABLE
python3 benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=nasnet --forward_only
sunway513 commented 5 years ago

Hi @ghostplant , Great to know the perf improved a bit with no-retry turned on :-) The following factors can affect the performance:

  1. System cooling environment TF workloads can push the GPU temperature to a pretty high level. If the GPU temp reaches to a critical level, DPM will downgrade the GPU core clocks or device memory clocks; therefore impact the performance.
  2. CPU and system memory Depends on your workloads, the CPU and system memory clocks can impact the performance. Tensorflow manages its own threadpool that maps to all the visible resources by default.
  3. Graphics rendering If the GPU is used for graphic rendering in parallel with deep learning workloads, depends on the situation, the performance can be affected as well.

Rebuild tensorflow from source could help as well, the following discussion can be related: https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/commit/cd33a983227f33cdf30cff8036227afa641738a2

sunway513 commented 5 years ago

@ghostplant , yes, nasnet is known to be slower with fusion enabled, we've been looking into optimizations on nasnet model.

ghostplant commented 5 years ago

@sunway513 Thanks! I tried the 2nd one by killing the Xorg graphic service which doesn't make any difference, but I'll try the last one which seems to work.

jlgreathouse commented 5 years ago

@sebpuetz if you are running kernel 4.20, then I suspect that you are not running our rock-dkms driver. You can have the package installed (which you've checked) but that does not mean that the driver is successfully running. As far as I know, the rock-dkms kernel module does not compile on 4.20.

If you run dkms status I suspect you will see that amdgpu is not listed as "installed", but rather something like "added". If that's the case, then you're using the 4.20 upstream driver rather than the ROCm 2.1 DKMS driver. This could be the source of your performance differences.

sebpuetz commented 5 years ago

@jlgreathouse thanks for the explanation, the rock-dkms driver indeed fails to build on 4.20 when trying to install through apt. I am going to switch over to Ubuntu 18.04 from Linux Mint which should be supported without having to use an upstream kernel. I'll check back with the new benchmark scores once I got everything up and running anyways.

sebpuetz commented 5 years ago

@sunway513 I reran the benchmarks in the docker container on Ubuntu 18.04 with kernel 4.18, numbers are much better now:

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

Step    Img/sec total_loss
1   images/sec: 294.4 +/- 0.0 (jitter = 0.0)    8.220
10  images/sec: 294.3 +/- 0.4 (jitter = 0.5)    8.282
20  images/sec: 294.3 +/- 0.2 (jitter = 0.5)    8.041
30  images/sec: 294.4 +/- 0.2 (jitter = 0.4)    8.321
40  images/sec: 294.3 +/- 0.2 (jitter = 0.5)    8.180
50  images/sec: 294.3 +/- 0.1 (jitter = 0.4)    8.237
60  images/sec: 294.3 +/- 0.1 (jitter = 0.4)    8.183
70  images/sec: 294.2 +/- 0.1 (jitter = 0.5)    8.155
80  images/sec: 294.1 +/- 0.1 (jitter = 0.4)    8.149
90  images/sec: 294.1 +/- 0.1 (jitter = 0.4)    8.209
100 images/sec: 294.1 +/- 0.1 (jitter = 0.4)    8.161
----------------------------------------------------------------
total images/sec: 294.02
----------------------------------------------------------------

and FP_16:

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step    Img/sec total_loss
1   images/sec: 414.6 +/- 0.0 (jitter = 0.0)    8.208
10  images/sec: 414.3 +/- 1.4 (jitter = 1.9)    8.187
20  images/sec: 415.3 +/- 0.8 (jitter = 1.2)    8.316
30  images/sec: 413.6 +/- 0.9 (jitter = 3.4)    8.176
40  images/sec: 413.3 +/- 0.7 (jitter = 3.6)    8.179
50  images/sec: 413.8 +/- 0.6 (jitter = 3.1)    8.379
60  images/sec: 414.3 +/- 0.6 (jitter = 2.0)    8.283
70  images/sec: 414.6 +/- 0.5 (jitter = 1.4)    8.128
80  images/sec: 414.8 +/- 0.5 (jitter = 1.1)    8.286
90  images/sec: 415.0 +/- 0.4 (jitter = 1.0)    8.345
100 images/sec: 415.1 +/- 0.4 (jitter = 0.9)    8.210
----------------------------------------------------------------
total images/sec: 415.03
----------------------------------------------------------------
WrightChen commented 5 years ago

@ghostplant linux kernel = Linux version 4.15.0-45-generic (buildd@lgw01-amd64-031) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019

rocm-dkms version = 2.1.96

geekboood commented 5 years ago

Hi! Could you guys try some RNNs or reinforce learning to see how well ROCm performs on these stuff? Here is my attempt. Commands are pip install -q -U tensor2tensor t2t-datagen --problem=squad --data_dir=~/t2t_data --tmp_dir=~/t2t_data/tmp t2t-trainer --data_dir=~/t2t_data --problem=squad --model=universal_transformer --hparams_set=universal_transformer_base --output_dir=~/output And my results are Under GTX1060 about 4.1 global steps / sec Under RX480 about 2.2 global steps / sec Under K80 in Colab about 3.1 global steps / sec The K80 is used in Colab so the number may be inaccurate.

ghostplant commented 5 years ago

@geekboood Is there benchmark script using synthetic_data as input for RNNs?

geekboood commented 5 years ago

@ghostplant I think Tensorflow must have provide such a benchmark script but my attempt is a real scenario you may encounter when you want to try something new.

ulyssesrr commented 5 years ago

@geekboood RX580 8GB inside Docker (Linux 4.20.12-arch1-1-ARCH) (Upstream rock driver)

First 5 global_step/sec reported: 2.51917, 2.72199, 2.68408, 2.68637, 2.66961 Mean: 2.656244 - Mean discarding first: 2.6905125

Running with TF_ROCM_FUSION_ENABLE=1 First 5 global_step/sec reported: 2.53731, 2.73576, 2.76201, 2.70581, 2.71649 Mean: 2.691476 - Mean discarding first: 2.7300175

MenSanYan commented 5 years ago

If I want to build a computer for running tensorflow, what kind of cards would be the most economical choice (Radeon VII or NVIDIAs) ?