Open NIIAS3050 opened 5 years ago
If you happen to have access to some AMD GPUs that are supported by the ROCm stack, consider running some benchmarks from the TensorFlow benchmarks repository. The README in the benchmarks/scripts/tf_cnn_benchmarks
directory provides some example usage.
Those scripts were used for the benchmarks shown on TensorFlows website.
I've run the following on a Vega FE (tensorflow-rocm==1.11.0
and rocm-dkms==1.9.211
).
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
This yields the following.
[...]
Done warm up
Step Img/sec total_loss
1 images/sec: 182.2 +/- 0.0 (jitter = 0.0) 8.325
10 images/sec: 182.3 +/- 0.1 (jitter = 0.2) 8.170
20 images/sec: 182.3 +/- 0.1 (jitter = 0.3) 8.247
30 images/sec: 182.1 +/- 0.1 (jitter = 0.3) 8.369
40 images/sec: 182.0 +/- 0.1 (jitter = 0.4) 8.401
50 images/sec: 181.9 +/- 0.1 (jitter = 0.5) 8.147
60 images/sec: 181.8 +/- 0.1 (jitter = 0.6) 8.340
70 images/sec: 181.6 +/- 0.1 (jitter = 0.7) 8.120
80 images/sec: 181.3 +/- 0.2 (jitter = 0.9) 8.415
90 images/sec: 180.5 +/- 0.3 (jitter = 1.1) 8.278
100 images/sec: 179.5 +/- 0.4 (jitter = 1.4) 8.328
----------------------------------------------------------------
total images/sec: 179.44
----------------------------------------------------------------
For comparison, the same command being run on a Tesla P100-PCIE-16GB (CUDA==9.2
, cuDNN==7.1.4
, and tf.__version__ == '1.11.0'
)
[...]
Done warm up
Step Img/sec total_loss
1 images/sec: 248.6 +/- 0.0 (jitter = 0.0) 8.325
10 images/sec: 248.6 +/- 0.2 (jitter = 0.6) 8.164
20 images/sec: 248.5 +/- 0.1 (jitter = 0.8) 8.251
30 images/sec: 248.4 +/- 0.1 (jitter = 0.7) 8.355
40 images/sec: 248.3 +/- 0.1 (jitter = 0.6) 8.417
50 images/sec: 248.2 +/- 0.1 (jitter = 0.6) 8.152
60 images/sec: 248.2 +/- 0.1 (jitter = 0.6) 8.353
70 images/sec: 248.1 +/- 0.1 (jitter = 0.7) 8.109
80 images/sec: 247.7 +/- 0.1 (jitter = 0.8) 8.405
90 images/sec: 247.5 +/- 0.1 (jitter = 0.9) 8.266
100 images/sec: 247.2 +/- 0.2 (jitter = 1.2) 8.344
----------------------------------------------------------------
total images/sec: 247.13
----------------------------------------------------------------
Bear in mind, I haven't done anything to try and optimize performance on the Vega FE. These are essentially "out-of-the-box" results.
@pricebenjamin when I try to run that same script ( I cloned the repo ) I get an import error:
ImportError: No module named 'tensorflow.python.data.experimental'
@Mandrewoid, if you haven't already, I'd recommend checking out the branch corresponding to your version of tensorflow, e.g.
cd /path/to/benchmarks
git checkout cnn_tf_v1.11_compatible
Nice that seems to have done it. I did not realize mainline TF had already advanced to 1.12 rookie mistake
I have tried runnning benchmarks on my environment(Kernel 4.15, ROCm1.9.2, TF1.12 with RX 580).
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=(32|64) \
--model=(alexnet|inceptionv3|vgg16|googlenet|resnet50)
result are as follow:
AlexNet batch:32 397.27/sec
batch:64 518.03/sec
InceptionV3 batch:32 47.78/sec
batch:64 50.66/sec
googLeNet batch:32 239.28/sec
batch:64 256.05/sec
ResNet50 batch:32 86.81/sec
batch:64 98.57/sec
In my environment, Vgg16 has not runnning well.
I have tested with vega64, ubuntu18.04, ROCm1.9.2, tf1.12:
1 resnet50: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
1080ti: 212 images/sec (278 fp16)
vega64: 191 images/sec (190 fp16)
2 resnet101: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet101
1080ti: 121.14 images/sec (168 fp16)
vega64: 101.15 images/sec (93 fp16), if fp16, --batch_size could be 64, while fp32, 64 will crash
4 mobilenet: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=mobilenet
1080ti: 2865 images/sec
vega64: 462 images/sec
The nv gtx1080 ti was tested on another machine with cuda10, ubuntu 18.04.
There are two values didn't add up:
Considering vega64 supports native half precision and fp16 should be a good selling point for amd vega. how is it slower if using fp16? I guess this is probably due to software support, especially ROCm. Can anyone please test it with --use_fp16 and see if having similar results.
@kazulittlefox my vega runs smoothly with vgg16 @105images/sec
@fshi98 that might be because of https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/pull/143#issuecomment-415798399
@Mandrewoid Thanks. That may be the reason. However, my rocblas version is 0.14.3.0, and I tested //tensorflow/python/kernel_tests:batch_matmul_op_test, and passed all 47 tests in 10.653s as in #143 Also, i tested and passed https://github.com/ROCmSoftwarePlatform/rocBLAS/issues/340
This may not be the same error bugs as #143, but may be some performance issues
@sebpuetz Would you be willing to post some numbers for the Radeon VII, including fp16 performance? I have yet to find any cloud providers with these cards. Trying to get some info for #288.
Radeon VII
rocm==2.1.96
installed through apt
tensorflow==1.12
installed through pip
no further tuning
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Step Img/sec total_loss
1 images/sec: 190.3 +/- 0.0 (jitter = 0.0) 8.217
10 images/sec: 195.7 +/- 0.9 (jitter = 3.1) 8.123
20 images/sec: 196.4 +/- 0.5 (jitter = 1.8) 8.231
30 images/sec: 196.8 +/- 0.4 (jitter = 1.1) 8.268
40 images/sec: 197.1 +/- 0.3 (jitter = 0.9) 8.355
50 images/sec: 197.2 +/- 0.2 (jitter = 0.8) 8.013
60 images/sec: 197.3 +/- 0.2 (jitter = 0.7) 8.263
70 images/sec: 196.8 +/- 0.3 (jitter = 1.1) 8.304
80 images/sec: 196.9 +/- 0.2 (jitter = 1.1) 8.228
90 images/sec: 196.9 +/- 0.2 (jitter = 0.9) 8.283
100 images/sec: 197.0 +/- 0.2 (jitter = 0.8) 8.271
----------------------------------------------------------------
total images/sec: 196.98
----------------------------------------------------------------
FP16:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16
Step Img/sec total_loss
1 images/sec: 262.9 +/- 0.0 (jitter = 0.0) 8.162
10 images/sec: 261.9 +/- 0.6 (jitter = 0.7) 8.211
20 images/sec: 260.4 +/- 0.6 (jitter = 2.6) 8.375
30 images/sec: 260.6 +/- 0.5 (jitter = 2.6) 8.264
40 images/sec: 259.6 +/- 0.6 (jitter = 3.1) 8.116
50 images/sec: 259.6 +/- 0.5 (jitter = 3.1) 8.169
60 images/sec: 259.9 +/- 0.5 (jitter = 2.6) 8.325
70 images/sec: 259.3 +/- 0.5 (jitter = 3.5) 8.374
80 images/sec: 259.4 +/- 0.4 (jitter = 3.4) 8.041
90 images/sec: 259.3 +/- 0.4 (jitter = 3.6) 8.298
100 images/sec: 259.4 +/- 0.3 (jitter = 3.5) 8.376
----------------------------------------------------------------
total images/sec: 259.29
----------------------------------------------------------------
This one made the GPU sound like a jet engine:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step Img/sec total_loss
1 images/sec: 216.3 +/- 0.0 (jitter = 0.0) 8.219
10 images/sec: 215.9 +/- 0.3 (jitter = 0.3) 8.289
20 images/sec: 216.0 +/- 0.2 (jitter = 0.3) 8.064
30 images/sec: 215.9 +/- 0.1 (jitter = 0.3) 8.310
40 images/sec: 215.9 +/- 0.1 (jitter = 0.3) 8.197
50 images/sec: 215.9 +/- 0.1 (jitter = 0.3) 8.277
60 images/sec: 215.7 +/- 0.1 (jitter = 0.4) 8.162
70 images/sec: 215.7 +/- 0.1 (jitter = 0.4) 8.159
80 images/sec: 215.7 +/- 0.1 (jitter = 0.4) 8.139
90 images/sec: 215.7 +/- 0.1 (jitter = 0.4) 8.196
100 images/sec: 215.7 +/- 0.1 (jitter = 0.4) 8.163
----------------------------------------------------------------
total images/sec: 215.72
----------------------------------------------------------------
FP 16:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step Img/sec total_loss
1 images/sec: 288.2 +/- 0.0 (jitter = 0.0) 8.209
10 images/sec: 283.8 +/- 1.1 (jitter = 2.7) 8.189
20 images/sec: 284.0 +/- 0.9 (jitter = 4.6) 8.316
30 images/sec: 284.9 +/- 0.7 (jitter = 4.5) 8.195
40 images/sec: 284.5 +/- 0.6 (jitter = 4.0) 8.180
50 images/sec: 284.3 +/- 0.5 (jitter = 3.7) 8.402
60 images/sec: 285.0 +/- 0.5 (jitter = 4.8) 8.271
70 images/sec: 285.4 +/- 0.4 (jitter = 3.7) 8.134
80 images/sec: 285.7 +/- 0.4 (jitter = 2.7) 8.299
90 images/sec: 286.0 +/- 0.4 (jitter = 1.5) 8.349
100 images/sec: 286.2 +/- 0.3 (jitter = 1.4) 8.213
----------------------------------------------------------------
total images/sec: 286.17
----------------------------------------------------------------
Hi @sebpuetz , maybe you can also try to enable fusion support :-) Doc is as following: https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/rocm-port-overview.md#fusion-support
Improvements across the board with TF_ROCM_FUSION_ENABLE=1
. The displayed temp in rocm-smi
went above 90°C on all tests, the rocm-smi
output didn't include clocks so I can't tell whether any termal throttling was happening.
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50
Step Img/sec total_loss
1 images/sec: 208.4 +/- 0.0 (jitter = 0.0) 8.217
10 images/sec: 207.6 +/- 0.5 (jitter = 0.5) 8.124
20 images/sec: 207.7 +/- 0.3 (jitter = 0.5) 8.235
30 images/sec: 207.3 +/- 0.4 (jitter = 0.4) 8.268
40 images/sec: 207.2 +/- 0.4 (jitter = 0.4) 8.357
50 images/sec: 207.2 +/- 0.4 (jitter = 0.4) 8.012
60 images/sec: 207.2 +/- 0.3 (jitter = 0.4) 8.248
70 images/sec: 207.1 +/- 0.3 (jitter = 0.4) 8.305
80 images/sec: 207.0 +/- 0.3 (jitter = 0.5) 8.223
90 images/sec: 205.7 +/- 0.9 (jitter = 0.5) 8.322
100 images/sec: 205.7 +/- 0.8 (jitter = 0.5) 8.268
----------------------------------------------------------------
total images/sec: 205.65
----------------------------------------------------------------
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --use_fp16
Step Img/sec total_loss
1 images/sec: 273.0 +/- 0.0 (jitter = 0.0) 8.171
10 images/sec: 272.6 +/- 0.9 (jitter = 1.0) 8.223
20 images/sec: 271.5 +/- 1.1 (jitter = 0.9) 8.375
30 images/sec: 272.0 +/- 0.8 (jitter = 0.9) 8.282
40 images/sec: 272.1 +/- 0.6 (jitter = 0.9) 8.122
50 images/sec: 272.1 +/- 0.6 (jitter = 0.8) 8.144
60 images/sec: 272.0 +/- 0.5 (jitter = 0.8) 8.333
70 images/sec: 271.5 +/- 0.5 (jitter = 1.0) 8.357
80 images/sec: 271.2 +/- 0.5 (jitter = 1.3) 8.034
90 images/sec: 271.2 +/- 0.4 (jitter = 1.3) 8.289
100 images/sec: 270.9 +/- 0.4 (jitter = 1.5) 8.361
----------------------------------------------------------------
total images/sec: 270.81
----------------------------------------------------------------
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step Img/sec total_loss
1 images/sec: 227.7 +/- 0.0 (jitter = 0.0) 8.221
10 images/sec: 225.6 +/- 0.5 (jitter = 2.2) 8.289
20 images/sec: 225.5 +/- 0.4 (jitter = 1.9) 8.068
30 images/sec: 225.7 +/- 0.3 (jitter = 1.8) 8.304
40 images/sec: 225.4 +/- 0.5 (jitter = 1.2) 8.183
50 images/sec: 225.5 +/- 0.4 (jitter = 1.0) 8.261
60 images/sec: 225.6 +/- 0.4 (jitter = 1.1) 8.203
70 images/sec: 225.6 +/- 0.3 (jitter = 1.1) 8.165
80 images/sec: 225.6 +/- 0.3 (jitter = 1.0) 8.168
90 images/sec: 225.7 +/- 0.3 (jitter = 1.0) 8.196
100 images/sec: 225.6 +/- 0.2 (jitter = 1.1) 8.138
----------------------------------------------------------------
total images/sec: 225.62
----------------------------------------------------------------
TF_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step Img/sec total_loss
1 images/sec: 302.0 +/- 0.0 (jitter = 0.0) 8.213
10 images/sec: 300.2 +/- 0.5 (jitter = 1.5) 8.181
20 images/sec: 298.7 +/- 0.8 (jitter = 2.5) 8.324
30 images/sec: 297.7 +/- 0.8 (jitter = 2.2) 8.197
40 images/sec: 297.7 +/- 0.6 (jitter = 3.0) 8.173
50 images/sec: 297.9 +/- 0.6 (jitter = 3.0) 8.400
60 images/sec: 297.9 +/- 0.5 (jitter = 3.0) 8.267
70 images/sec: 298.4 +/- 0.5 (jitter = 2.8) 8.140
80 images/sec: 298.6 +/- 0.4 (jitter = 2.7) 8.283
90 images/sec: 298.6 +/- 0.4 (jitter = 2.8) 8.337
100 images/sec: 298.7 +/- 0.4 (jitter = 2.6) 8.208
----------------------------------------------------------------
total images/sec: 298.60
----------------------------------------------------------------
Hi @sebpuetz , thanks for the update!
However, the performance numbers seem not right.
Can you provide me the VBIOS version of your board? The following command would do:
/opt/rocm/bin/rocm-smi -v
/opt/rocm/bin/rocm-smi -v
GPU[0] : VBIOS version: 113-D3600200-105
Radeon RX Vega 64
memoryClockRate (GHz) 1.63
Total memory: 7.98GiB
Free memory: 7.73GiB
rocm==2.1.96
installed through apt
tensorflow==1.12
installed through pip
Some Frameworks use option ' TF_ROCM_FUSION_ENABLE=1 ' doesn't change much, so I'm not giving the FUSION = 1 results. Due to lack of memory, there are some frameworks can't run on the batch_size=128.
ResNet50 | AlexNet | Inception v3 | VGG16 | GoogLeNet | ResNet152 | |
---|---|---|---|---|---|---|
batch_size=512 | / | 1573.01 | / | / | / | / |
batch_size=256 | / | 1420.65 | / | / | / | / |
batch_size=128 | / | 1345.73 | / | / | 498.73 | / |
batch_size=64 | 190.58 | 1151.98 | 103.82 | 101.95 | 474.07 | / |
batch_size=32 | 171.70 | 971.85 | 98.50 | 91.80 | 424.32 | 68.71 |
batch_size=128; FUSION = 1 | / | / | / | / | / | / |
batch_size=64; FUSION = 1 | 208.78 | / | 109.66 | / | / | / |
batch_size=32; FUSION = 1 | 187.76 | / | 105.20 | / | / | 75.81 |
Hi @sebpuetz , could you try to refresh your performance numbers using our official docker image?
If you've not configured the docker, the following script should do:
curl -sSL https://get.docker.com/ | sh
To run the benchmarks inside docker image:
alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx -v /data/imagenet/tf:/imagenet'
drun rocm/tensorflow:rocm2.1-tf1.12-python3
cd ~/benchmarks/scripts/tf_cnn_benchmarks
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Thanks for your attention, and looking forward to your updates :-)
6-core Intel i7 8700 with 16GB ram, and 400GB SSD disk.
Radeon VII
rocm==2.1.96
installed through apt
tensorflow==1.12
installed through pip
no further tuning
TC_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 Step Img/sec total_loss 1 images/sec: 250.0 +/- 0.0 (jitter = 0.0) 8.348 10 images/sec: 248.0 +/- 1.4 (jitter = 0.7) 8.144 20 images/sec: 248.7 +/- 0.8 (jitter = 0.4) 8.440 30 images/sec: 248.8 +/- 0.6 (jitter = 0.4) 8.140 40 images/sec: 248.7 +/- 0.6 (jitter = 0.4) 8.474 50 images/sec: 248.5 +/- 0.5 (jitter = 0.4) 8.322 60 images/sec: 248.5 +/- 0.5 (jitter = 0.5) 8.317 70 images/sec: 248.5 +/- 0.4 (jitter = 0.6) 8.010 80 images/sec: 248.4 +/- 0.4 (jitter = 0.6) 8.272 90 images/sec: 248.5 +/- 0.4 (jitter = 0.6) 8.289 100 images/sec: 248.4 +/- 0.3 (jitter = 0.6) 8.108
total images/sec: 248.34
TC_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 Step Img/sec total_loss 1 images/sec: 265.1 +/- 0.0 (jitter = 0.0) 8.324 10 images/sec: 264.3 +/- 0.5 (jitter = 0.3) 8.168 20 images/sec: 264.5 +/- 0.3 (jitter = 0.2) 8.261 30 images/sec: 264.4 +/- 0.3 (jitter = 0.3) 8.377 40 images/sec: 264.2 +/- 0.2 (jitter = 0.4) 8.408 50 images/sec: 264.1 +/- 0.2 (jitter = 0.5) 8.160 60 images/sec: 263.9 +/- 0.2 (jitter = 0.6) 8.341 70 images/sec: 263.8 +/- 0.2 (jitter = 0.6) 8.107 80 images/sec: 263.8 +/- 0.2 (jitter = 0.8) 8.404 90 images/sec: 263.8 +/- 0.2 (jitter = 0.7) 8.296 100 images/sec: 263.7 +/- 0.2 (jitter = 0.6) 8.348
total images/sec: 263.65
With a batch size of 256, i get out of memory errors. Funnily enough with a batch size of 155, it works, but is slower.
TC_ROCM_FUSION_ENABLE=1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=155 --model=resnet50
Step Img/sec total_loss 1 images/sec: 195.3 +/- 0.0 (jitter = 0.0) 8.394 10 images/sec: 194.6 +/- 0.7 (jitter = 0.6) 8.313 20 images/sec: 194.5 +/- 0.5 (jitter = 0.6) 8.154 30 images/sec: 194.4 +/- 0.3 (jitter = 0.7) 8.249 40 images/sec: 194.5 +/- 0.3 (jitter = 0.8) 8.165 50 images/sec: 194.4 +/- 0.2 (jitter = 1.0) 8.292 60 images/sec: 194.3 +/- 0.2 (jitter = 1.0) 8.340 70 images/sec: 194.3 +/- 0.2 (jitter = 0.9) 8.268 80 images/sec: 194.2 +/- 0.2 (jitter = 0.8) 8.227 90 images/sec: 194.2 +/- 0.2 (jitter = 0.8) 8.257 100 images/sec: 194.1 +/- 0.2 (jitter = 0.9) 8.183
total images/sec: 194.04
Leaving out TC_ROCM_FUSION_ENABLE does not make any difference. /opt/rocm/bin/rocm-smi -v VBIOS version: 113-D3600200-105
According to this blog, https://www.pugetsystems.com/labs/hpc/NVIDIA-RTX-2080-Ti-vs-2080-vs-1080-Ti-vs-Titan-V-TensorFlow-Performance-with-CUDA-10-0-1247/, the 2080Ti gets 280 images/sec and the 1080Ti gets 207 images/sec for FP32 training.
One more: python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16 Step Img/sec total_loss 1 images/sec: 377.7 +/- 0.0 (jitter = 0.0) 8.246 10 images/sec: 375.9 +/- 2.2 (jitter = 0.7) 8.261 20 images/sec: 377.9 +/- 1.2 (jitter = 0.9) 8.279 30 images/sec: 378.3 +/- 0.9 (jitter = 0.9) 8.365 40 images/sec: 378.2 +/- 0.7 (jitter = 0.5) 8.237 50 images/sec: 378.3 +/- 0.6 (jitter = 0.4) 8.295 60 images/sec: 378.4 +/- 0.5 (jitter = 0.4) 8.203 70 images/sec: 378.4 +/- 0.5 (jitter = 0.5) 8.129 80 images/sec: 377.9 +/- 0.6 (jitter = 0.6) 8.264 90 images/sec: 378.0 +/- 0.5 (jitter = 0.8) 8.163 100 images/sec: 377.9 +/- 0.5 (jitter = 0.8) 8.239
total images/sec: 377.79
@jimdowling that's some impressive perf !
@jimdowling these numbers seem substantially higher than the ones I got, what OS and kernel are you on?
Hi, I executed the benchmarks in the docker container:
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step Img/sec total_loss
1 images/sec: 229.7 +/- 0.0 (jitter = 0.0) 8.221
10 images/sec: 225.4 +/- 0.8 (jitter = 2.7) 8.289
20 images/sec: 225.9 +/- 0.5 (jitter = 3.6) 8.054
30 images/sec: 226.6 +/- 0.4 (jitter = 2.1) 8.313
40 images/sec: 226.9 +/- 0.3 (jitter = 0.8) 8.187
50 images/sec: 227.2 +/- 0.3 (jitter = 0.7) 8.240
60 images/sec: 227.3 +/- 0.2 (jitter = 0.5) 8.192
70 images/sec: 227.4 +/- 0.2 (jitter = 0.5) 8.143
80 images/sec: 227.6 +/- 0.2 (jitter = 0.5) 8.150
90 images/sec: 227.6 +/- 0.2 (jitter = 0.5) 8.217
100 images/sec: 227.7 +/- 0.2 (jitter = 0.5) 8.163
----------------------------------------------------------------
total images/sec: 227.66
----------------------------------------------------------------
and
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step Img/sec total_loss
1 images/sec: 300.8 +/- 0.0 (jitter = 0.0) 8.205
10 images/sec: 300.3 +/- 0.4 (jitter = 0.2) 8.170
20 images/sec: 300.3 +/- 0.3 (jitter = 0.5) 8.317
30 images/sec: 300.5 +/- 0.2 (jitter = 0.6) 8.201
40 images/sec: 300.6 +/- 0.2 (jitter = 0.5) 8.176
50 images/sec: 300.5 +/- 0.2 (jitter = 0.5) 8.398
60 images/sec: 300.3 +/- 0.2 (jitter = 0.5) 8.268
70 images/sec: 300.3 +/- 0.2 (jitter = 0.6) 8.140
80 images/sec: 300.4 +/- 0.2 (jitter = 0.6) 8.279
90 images/sec: 300.4 +/- 0.2 (jitter = 0.6) 8.328
100 images/sec: 300.3 +/- 0.2 (jitter = 0.6) 8.214
----------------------------------------------------------------
total images/sec: 300.29
----------------------------------------------------------------
@sunway513 these numbers are still pretty far away from what @jimdowling got, do you see a reason for this to happen?
Ubuntu 18.04. Python 2.7. Kernel is 4.15. I was not running Docker - bare metal.
Hi @jimdowling , Thanks for your posting! However, it seems there's a typo in your script, therefore TF fusion is not really enabled there. Could you try the following command again?
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
If fusion is enabled, you should see the following message at the run time:
2019-02-21 13:41:32.304325: I tensorflow/core/graph/gpu_fusion_pass.cc:454] ROCm Fusion is enabled.
Hi @sebpuetz , thanks for your updated numbers with docker!
in a parallel issue, you mentioned your system is Linux Mint 19.1, is that the same OS you ran the benchmark? May I know the kernel and driver version of your configurations? The following command would help:
uname -a
apt --installed list | grep rock-dkms
I believe your user-bit components were properly configured, as you got similar perf numbers using our official docker image. VBIOS version is good as well. We need to look into kernels and firmware.
Hi @sunway513 , I ran all benchmarks on Linux Mint 19.1
uname -a
Linux seb-desktop 4.20.7-042007-generic #201902061234 SMP Wed Feb 6 17:36:40 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
apt list --installed | grep rock-dkms
rock-dkms/Ubuntu 16.04,now 2.1-96 all [installed]
Linux Mint 19.1 is based on Ubuntu 18.04, so this looks like a mismatch here?
@sunway513
I am also using RX Vega 64
but I got such warning:
2019-02-21 14:26:23.732074: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter
warning: <unknown>:0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
2019-02-21 14:26:27.702436: I tensorflow/core/kernels/conv_grad_input_ops.cc:1023] running auto-tune for Backward-Data
2019-02-21 14:26:29.084753: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter
warning: <unknown>:0:0: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
2019-02-21 14:26:33.818470: I tensorflow/core/kernels/conv_grad_input_ops.cc:1023] running auto-tune for Backward-Data
2019-02-21 14:26:33.839322: I tensorflow/core/kernels/conv_grad_filter_ops.cc:975] running auto-tune for Backward-Filter
And the performance is ~10% loss compared with others' benchmark:
Step Img/sec total_loss
1 images/sec: 182.8 +/- 0.0 (jitter = 0.0) 8.217
10 images/sec: 187.2 +/- 0.9 (jitter = 0.7) 8.122
20 images/sec: 187.3 +/- 0.5 (jitter = 0.7) 8.229
30 images/sec: 187.1 +/- 0.4 (jitter = 0.9) 8.264
40 images/sec: 187.0 +/- 0.4 (jitter = 0.9) 8.347
50 images/sec: 187.0 +/- 0.3 (jitter = 1.1) 8.014
60 images/sec: 187.0 +/- 0.3 (jitter = 1.0) 8.264
70 images/sec: 186.8 +/- 0.3 (jitter = 1.1) 8.316
80 images/sec: 186.7 +/- 0.3 (jitter = 1.1) 8.231
90 images/sec: 186.7 +/- 0.2 (jitter = 1.2) 8.305
But it should be expected to have about 207 images/sec. Is it influenced by the warning above and how to fix the performance?
Hi @sebpuetz , yes, there should be a mismatch here. Theoretically, you don't need to install rock-dkms driver if your system is on 4.20 kernel, you can refer to the following doc: https://github.com/RadeonOpenCompute/ROCm#rocm-support-in-upstream-linux-kernels The behavior can be "undefined" if you install rock-dkms driver on 4.20 kernel, as that's not in our test coverage. Can you open another GitHub issue and let's track it there?
@WrightChen @sebpuetz Can you share the linux kernel and rocm-dkms version? I am using rocm-dkms==2.0.89 and linux-kernel=4.15.
Hi @ghostplant , the warning is normal in ROCm2.1, and that's from the upstream LLVM compiler. The performance shouldn't be affected by that. To triage your perf issue, could you try to run your benchmarks inside docker container and provide us the kernel/rock-dkms versions on your bare metal system?
@sunway513 the docker image I am using is rocm/tensorflow:rocm2.1-tf1.12-python3
but the rocm-dkms outside docker is 2.0.89.
@ghostplant , missed your last message ;-)
Can you post the log of the following command:
cat /sys/module/amdgpu/parameters/noretry
@sunway513 Yeah, it is:
root@amdgpu:~# cat /sys/module/amdgpu/parameters/noretry
0
@ghostplant , thanks, that's the issue. In rocm2.0 the kernel driver disabled noretry
bit by default, that can impact the performance on gfx900/gfx906; the issue is fixed in ROCm2.1.
To enable noretry bit, you can try the following command:
sudo -s
echo 1 > /sys/module/amdgpu/parameters/noretry
exit
I'd suggest you upgrade to rocm2.1 to permantely resolve it.
For ROCm2.0, the following command could enable noretry
as a default kernel option:
sudo sh -c 'echo "options amdgpu noretry=1" >> /etc/modprobe.d/amdgpu.conf'
sudo update-initramfs -u
sudo reboot
@sunway513 Yeah, it does get a slight boost for fusion-enabled case which is around 5%. If I enable fusion, the performance of resnet50 get raised from 187.0 to 202.0 images/sec:
40 images/sec: 202.0 +/- 0.3 (jitter = 0.9) 8.379
50 images/sec: 202.0 +/- 0.3 (jitter = 1.0) 8.007
60 images/sec: 201.9 +/- 0.3 (jitter = 1.0) 8.285
If I disable fusion, the performance of resnet50 is still bad:
40 images/sec: 181.7 +/- 0.3 (jitter = 0.8) 8.359
50 images/sec: 181.7 +/- 0.3 (jitter = 0.9) 7.994
60 images/sec: 181.6 +/- 0.3 (jitter = 0.9) 8.268
70 images/sec: 181.5 +/- 0.2 (jitter = 0.8) 8.291
I wonder how to reach the benchmark of 208 images/sec or even 226.6 images/sec?
Any other possibilities? I had update rocm-dkms to 2.1 and fully reboot the system, and also with noretry=1
set.
Thanks!
@sunway513 Do you think I should upgrade llvm and compile a native tensorflow version?
I saw the kernel fusion makes Nasnet model very slow:
TF_ROCM_FUSION_ENABLE=1 python3 benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=nasnet --forward_only
while disabling fusion is far better:
unset TF_ROCM_FUSION_ENABLE
python3 benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=nasnet --forward_only
Hi @ghostplant , Great to know the perf improved a bit with no-retry
turned on :-)
The following factors can affect the performance:
Rebuild tensorflow from source could help as well, the following discussion can be related: https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/commit/cd33a983227f33cdf30cff8036227afa641738a2
@ghostplant , yes, nasnet is known to be slower with fusion enabled, we've been looking into optimizations on nasnet model.
@sunway513 Thanks! I tried the 2nd one by killing the Xorg graphic service which doesn't make any difference, but I'll try the last one which seems to work.
@sebpuetz if you are running kernel 4.20, then I suspect that you are not running our rock-dkms
driver. You can have the package installed (which you've checked) but that does not mean that the driver is successfully running. As far as I know, the rock-dkms
kernel module does not compile on 4.20.
If you run dkms status
I suspect you will see that amdgpu
is not listed as "installed", but rather something like "added". If that's the case, then you're using the 4.20 upstream driver rather than the ROCm 2.1 DKMS driver. This could be the source of your performance differences.
@jlgreathouse thanks for the explanation, the rock-dkms driver indeed fails to build on 4.20 when trying to install through apt. I am going to switch over to Ubuntu 18.04 from Linux Mint which should be supported without having to use an upstream kernel. I'll check back with the new benchmark scores once I got everything up and running anyways.
@sunway513 I reran the benchmarks in the docker container on Ubuntu 18.04 with kernel 4.18, numbers are much better now:
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Step Img/sec total_loss
1 images/sec: 294.4 +/- 0.0 (jitter = 0.0) 8.220
10 images/sec: 294.3 +/- 0.4 (jitter = 0.5) 8.282
20 images/sec: 294.3 +/- 0.2 (jitter = 0.5) 8.041
30 images/sec: 294.4 +/- 0.2 (jitter = 0.4) 8.321
40 images/sec: 294.3 +/- 0.2 (jitter = 0.5) 8.180
50 images/sec: 294.3 +/- 0.1 (jitter = 0.4) 8.237
60 images/sec: 294.3 +/- 0.1 (jitter = 0.4) 8.183
70 images/sec: 294.2 +/- 0.1 (jitter = 0.5) 8.155
80 images/sec: 294.1 +/- 0.1 (jitter = 0.4) 8.149
90 images/sec: 294.1 +/- 0.1 (jitter = 0.4) 8.209
100 images/sec: 294.1 +/- 0.1 (jitter = 0.4) 8.161
----------------------------------------------------------------
total images/sec: 294.02
----------------------------------------------------------------
and FP_16:
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step Img/sec total_loss
1 images/sec: 414.6 +/- 0.0 (jitter = 0.0) 8.208
10 images/sec: 414.3 +/- 1.4 (jitter = 1.9) 8.187
20 images/sec: 415.3 +/- 0.8 (jitter = 1.2) 8.316
30 images/sec: 413.6 +/- 0.9 (jitter = 3.4) 8.176
40 images/sec: 413.3 +/- 0.7 (jitter = 3.6) 8.179
50 images/sec: 413.8 +/- 0.6 (jitter = 3.1) 8.379
60 images/sec: 414.3 +/- 0.6 (jitter = 2.0) 8.283
70 images/sec: 414.6 +/- 0.5 (jitter = 1.4) 8.128
80 images/sec: 414.8 +/- 0.5 (jitter = 1.1) 8.286
90 images/sec: 415.0 +/- 0.4 (jitter = 1.0) 8.345
100 images/sec: 415.1 +/- 0.4 (jitter = 0.9) 8.210
----------------------------------------------------------------
total images/sec: 415.03
----------------------------------------------------------------
@ghostplant linux kernel = Linux version 4.15.0-45-generic (buildd@lgw01-amd64-031) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019
rocm-dkms version = 2.1.96
Hi! Could you guys try some RNNs or reinforce learning to see how well ROCm performs on these stuff? Here is my attempt.
Commands are
pip install -q -U tensor2tensor
t2t-datagen --problem=squad --data_dir=~/t2t_data --tmp_dir=~/t2t_data/tmp
t2t-trainer --data_dir=~/t2t_data --problem=squad --model=universal_transformer --hparams_set=universal_transformer_base --output_dir=~/output
And my results are
Under GTX1060 about 4.1 global steps / sec
Under RX480 about 2.2 global steps / sec
Under K80 in Colab about 3.1 global steps / sec
The K80 is used in Colab so the number may be inaccurate.
@geekboood Is there benchmark script using synthetic_data as input for RNNs?
@ghostplant I think Tensorflow must have provide such a benchmark script but my attempt is a real scenario you may encounter when you want to try something new.
@geekboood RX580 8GB inside Docker (Linux 4.20.12-arch1-1-ARCH) (Upstream rock driver)
First 5 global_step/sec reported:
2.51917, 2.72199, 2.68408, 2.68637, 2.66961
Mean: 2.656244 - Mean discarding first: 2.6905125
Running with TF_ROCM_FUSION_ENABLE=1
First 5 global_step/sec reported:
2.53731, 2.73576, 2.76201, 2.70581, 2.71649
Mean: 2.691476 - Mean discarding first: 2.7300175
If I want to build a computer for running tensorflow, what kind of cards would be the most economical choice (Radeon VII or NVIDIAs) ?
It would be very useful to compare real training performance on amd and nvidia cards. For Nvidia cards we have a lot of graphs and tests, for example: https://github.com/u39kun/deep-learning-benchmark But for AMD cards there is no performance metrics. It will be great to made direct comparsion between AND and NVIDIA with last cuDNN.