Dramatic difference in perf between 1080ti and VEGA FE

PhilipDeegan commented 5 years ago

Hey there,

I'm trialing some code to benchmark my VEGA vs a colleagues 1080ti.

I've noticed some very peculiar differences in time to fit per epoch, I'm guessing I'm messing up in some way.

For one epoch on AMD 2990WX ~400 seconds

For one epoch on 1080ti < 100s

For one epoch on VEGA FE > 40 minutes

sunway513 commented 5 years ago

Hi @Dekken , could you help fill the issue template so we can understand the issue better? https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/.github/ISSUE_TEMPLATE/00-bug-performance-issue.md

PhilipDeegan commented 5 years ago

Have I written custom code: yes OS Platform and Distribution: Linux Ubuntu 18 TensorFlow installed from source TensorFlow version: ROCm dev Python version: 3.7 Bazel version: .21 GCC/Compiler version 8.2 ROCm/MIOpen version: latest from xenial repo GPU model and memory: VEGA FE 16GB

== cat /etc/issue ===============================================
Linux ws156 4.18.0-15-generic #16-Ubuntu SMP Thu Feb 7 10:56:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
VERSION="18.10 (Cosmic Cuttlefish)"
VERSION_ID="18.10"
VERSION_CODENAME=cosmic

== are we in docker =============================================
No

== compiler =====================================================
c++ (Ubuntu 8.2.0-7ubuntu1) 8.2.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

== uname -a =====================================================
Linux ws156 4.18.0-15-generic #16-Ubuntu SMP Thu Feb 7 10:56:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

== check pips ===================================================
numpy                    1.16.1
numpydoc                 0.8.0
protobuf                 3.6.1
tensorflow               1.12.0
tensorflow-estimator     1.13.0rc0

== check for virtualenv =========================================
False

== tensorflow import ============================================
Limited tf.compat.v2.summary API due to missing TensorBoard installation
tf.VERSION = 1.12.0
tf.GIT_VERSION = merge-190218-2-gd5532be4
tf.COMPILER_VERSION = merge-190218-2-gd5532be4
Sanity check: array([1], dtype=int32)

== env ==========================================================
LD_LIBRARY_PATH is unset
DYLD_LIBRARY_PATH is unset

== nvidia-smi ===================================================
./sys.sh: line 105: nvidia-smi: command not found

== cuda libs  ===================================================

sunway513 commented 5 years ago

Thanks @Dekken , first I'd suggest is to try our docker image, the goal is to isolate the issue to either user-bit configurations or kernel drivers/firmware. Can you refer to the following comment and get us the perf results of resnet50? https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173#issuecomment-465871004

PhilipDeegan commented 5 years ago

I'd rather not install docker if it's ok

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50

Done warm up
Step    Img/sec total_loss
1       images/sec: 207.5 +/- 0.0 (jitter = 0.0)        7.972
10      images/sec: 207.0 +/- 0.2 (jitter = 1.1)        7.856
20      images/sec: 206.8 +/- 0.2 (jitter = 0.5)        7.914
30      images/sec: 206.8 +/- 0.2 (jitter = 0.5)        7.734
40      images/sec: 206.8 +/- 0.1 (jitter = 0.6)        7.969
50      images/sec: 206.7 +/- 0.1 (jitter = 0.5)        8.025
60      images/sec: 206.7 +/- 0.1 (jitter = 0.6)        7.896
70      images/sec: 206.8 +/- 0.1 (jitter = 0.6)        7.991
80      images/sec: 206.8 +/- 0.1 (jitter = 0.6)        7.816
90      images/sec: 206.8 +/- 0.1 (jitter = 0.5)        7.799
100     images/sec: 206.7 +/- 0.1 (jitter = 0.6)        7.821
----------------------------------------------------------------
total images/sec: 206.71
----------------------------------------------------------------

TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step    Img/sec total_loss
1       images/sec: 198.6 +/- 0.0 (jitter = 0.0)        7.878
10      images/sec: 198.8 +/- 0.3 (jitter = 1.1)        7.958
20      images/sec: 199.3 +/- 0.2 (jitter = 0.6)        7.949
30      images/sec: 199.3 +/- 0.2 (jitter = 0.6)        7.948
40      images/sec: 199.4 +/- 0.1 (jitter = 0.7)        7.967
50      images/sec: 199.4 +/- 0.1 (jitter = 0.7)        7.706
60      images/sec: 199.4 +/- 0.1 (jitter = 0.7)        7.923
70      images/sec: 199.4 +/- 0.1 (jitter = 0.6)        7.849
80      images/sec: 199.4 +/- 0.1 (jitter = 0.5)        7.969
90      images/sec: 199.3 +/- 0.1 (jitter = 0.5)        7.810
100     images/sec: 199.3 +/- 0.1 (jitter = 0.6)        7.785
----------------------------------------------------------------
total images/sec: 199.25
----------------------------------------------------------------

sunway513 commented 5 years ago

Hmm, the perf numbers look normal for Vega FE workstation card. Can we have scripts/steps to repro your perf issue?

PhilipDeegan commented 5 years ago

my test case is about 30GB, I'll see if I can minimize it, it's not mine

PhilipDeegan commented 5 years ago

for sake of completeness, there's a 6GB archive available here https://hephaistos.lpp.polytechnique.fr/data/jeandet/dekken.tgz

extract and run test.py

CPU only runs into segfaults I have found, none reported in regular TF

CPU only with 2990WX

sunway513 commented 5 years ago

Thanks @Dekken , let me try it out.

PhilipDeegan commented 5 years ago

mine with i76950x

CUDA_VISIBLE_DEVICES="" python3 test.py
Epoch 1/10
/home/philix/.local/lib/python3.7/site-packages/keras/engine/training_utils.py:481: UserWarning: Found both `sample_weight` and `class_weight`: `class_weight` argument will be ignored.
  warnings.warn('Found both `sample_weight` and `class_weight`: '
2019-02-21 22:28:37.008764: E tensorflow/stream_executor/rocm/rocm_driver.cc:1020] could not retrieve ROCM device count: HIP_ERROR_NoDevice
10229/10229 [==============================] - 205s 20ms/step - loss: 0.0491 - val_loss: 0.0161
Epoch 2/10
10229/10229 [==============================] - 204s 20ms/step - loss: 0.0383 - val_loss: 0.0141
Epoch 3/10
10229/10229 [==============================] - 204s 20ms/step - loss: 0.0348 - val_loss: 0.0137
Epoch 4/10
10229/10229 [==============================] - 203s 20ms/step - loss: 0.0330 - val_loss: 0.0123
Epoch 5/10
10229/10229 [==============================] - 203s 20ms/step - loss: 0.0317 - val_loss: 0.0118
Epoch 6/10
 7636/10229 [=====================>........] - ETA: 49s - loss: 0.0307Segmentation fault (core dumped)

sunway513 commented 5 years ago

@Dekken , can you post the log without CUDA_VISIBLE_DEVICES=""? The environment variable works similarly to HIP_VISIBLE_DEVICES, with that being set, tensorflow-rocm won't be able to see or use any GPUs.

PhilipDeegan commented 5 years ago

the normal log is what takes 40+ minutes per epoch - so I'd rather not do that completely

2019-02-21 23:01:58.296943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1673] Adding visible gpu devices: 0, 1
2019-02-21 23:01:58.296970: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-21 23:01:58.296978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1090]      0 1
2019-02-21 23:01:58.296986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 0:   N N
2019-02-21 23:01:58.296993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 1:   N N
2019-02-21 23:01:58.297052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Vega 10 XTX [Radeon Vega Frontier Edition], pci bus id: 0000:03:00.0)
2019-02-21 23:01:58.314591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 15306 MB memory) -> physical GPU (device: 1, name: Vega 10 XTX [Radeon Vega Frontier Edition], pci bus id: 0000:0c:00.0)
2019-02-21 23:02:03.002830: I tensorflow/core/kernels/conv_grad_input_ops.cc:997] running auto-tune for Backward-Data
2019-02-21 23:02:03.078605: I tensorflow/core/kernels/conv_grad_filter_ops.cc:886] running auto-tune for Backward-Filter
2019-02-21 23:02:03.164291: I tensorflow/core/kernels/conv_grad_input_ops.cc:997] running auto-tune for Backward-Data
2019-02-21 23:02:03.262944: I tensorflow/core/kernels/conv_grad_filter_ops.cc:886] running auto-tune for Backward-Filter
2019-02-21 23:02:03.310904: I tensorflow/core/kernels/conv_grad_filter_ops.cc:886] running auto-tune for Backward-Filter
  222/10229 [..............................] - ETA: 40:30 - loss: 0.0836

sunway513 commented 5 years ago

Okay, I can repro your issue now. Looks like the GPUs were under very light loads during training. Next step would be to profile the workloads on ROCm and the other targets, and look into the device placement of each ops, please stay tuned.

PhilipDeegan commented 5 years ago

Did you confirm CPU segfault by any chance?

sunway513 commented 5 years ago

Hi @Dekken , the CPU segfault was not observed on my local system. While we look into the issue, can you share a bit background of the case? e.g. is it based on a public project? Or are there any similar public models we can try as well?

PhilipDeegan commented 5 years ago

It's not my project so I'm going to see if I can get a colleague in.

PhilipDeegan commented 5 years ago

pinging @gautiernguyen @nicolasaunai

PhilipDeegan commented 5 years ago

I think my segfault was to OOM

PhilipDeegan commented 5 years ago

My colleagues are currently on vacation but this is a part of a PhD in the plasma physics department around "Earth's magnetopause detection from in situ satelites measurements with ML"

not that I know what this means

parallelo commented 5 years ago

Made a bit of progress this week. Turns out we were using some non-optimal reduction code due to some stale ROCm-related ifdefs. We have some local changes that show significant improvements.

Here's where the (initial) bottleneck occurs:
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/tensorflow/core/kernels/reduction_gpu_kernels.cu.h#L713-L736

Btw - thanks for reporting this issue.

whchung commented 5 years ago

@parallelo can’t wait for your PR J

parallelo commented 5 years ago

@whchung

Currently wrapping up some mods. It is running now, and it appears to be converging (as best I can tell).

There's always room for more improvement, but the initial reduction kernel bottleneck appears to be fixed :-)

PhilipDeegan commented 5 years ago

with #344 merged, should we be good?

if I compile "develop-upstream" myself I should be able to confirm?

parallelo commented 5 years ago

Hi @Dekken - Worth another try now. Please check and report back here.

PhilipDeegan commented 5 years ago

there's certainly a differnce

we're at 300 seconds an epoch

PhilipDeegan commented 5 years ago

should we in theory be on par (or better) than a 1080ti with a VEGA FE?

sunway513 commented 5 years ago

Hi @Dekken , the TF-ROCm 1.14 release includes the PR #344. Can you check it there and let us know if it's okay to close this issue?

PhilipDeegan commented 5 years ago

is there somewhere to continue to assess the performance vs a 1080ti?

sunway513 commented 5 years ago

@Dekken , one path is to profile your workload using RPT: https://scchan.github.io/hcc/md__home_scchan_code_hcc_doc_markdown_hcc_profile.html And then cross-compare the nvprof results from 1080ti. With that, we might be able to figure out if there's any other avenue or low hanging fruit to improve the performance.

ROCm / tensorflow-upstream

Dramatic difference in perf between 1080ti and VEGA FE #330