Closed PhilipDeegan closed 5 years ago
Hi @Dekken , could you help fill the issue template so we can understand the issue better? https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/.github/ISSUE_TEMPLATE/00-bug-performance-issue.md
Have I written custom code: yes OS Platform and Distribution: Linux Ubuntu 18 TensorFlow installed from source TensorFlow version: ROCm dev Python version: 3.7 Bazel version: .21 GCC/Compiler version 8.2 ROCm/MIOpen version: latest from xenial repo GPU model and memory: VEGA FE 16GB
== cat /etc/issue ===============================================
Linux ws156 4.18.0-15-generic #16-Ubuntu SMP Thu Feb 7 10:56:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
VERSION="18.10 (Cosmic Cuttlefish)"
VERSION_ID="18.10"
VERSION_CODENAME=cosmic
== are we in docker =============================================
No
== compiler =====================================================
c++ (Ubuntu 8.2.0-7ubuntu1) 8.2.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
== uname -a =====================================================
Linux ws156 4.18.0-15-generic #16-Ubuntu SMP Thu Feb 7 10:56:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
== check pips ===================================================
numpy 1.16.1
numpydoc 0.8.0
protobuf 3.6.1
tensorflow 1.12.0
tensorflow-estimator 1.13.0rc0
== check for virtualenv =========================================
False
== tensorflow import ============================================
Limited tf.compat.v2.summary API due to missing TensorBoard installation
tf.VERSION = 1.12.0
tf.GIT_VERSION = merge-190218-2-gd5532be4
tf.COMPILER_VERSION = merge-190218-2-gd5532be4
Sanity check: array([1], dtype=int32)
== env ==========================================================
LD_LIBRARY_PATH is unset
DYLD_LIBRARY_PATH is unset
== nvidia-smi ===================================================
./sys.sh: line 105: nvidia-smi: command not found
== cuda libs ===================================================
Thanks @Dekken , first I'd suggest is to try our docker image, the goal is to isolate the issue to either user-bit configurations or kernel drivers/firmware. Can you refer to the following comment and get us the perf results of resnet50? https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173#issuecomment-465871004
I'd rather not install docker if it's ok
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
Done warm up
Step Img/sec total_loss
1 images/sec: 207.5 +/- 0.0 (jitter = 0.0) 7.972
10 images/sec: 207.0 +/- 0.2 (jitter = 1.1) 7.856
20 images/sec: 206.8 +/- 0.2 (jitter = 0.5) 7.914
30 images/sec: 206.8 +/- 0.2 (jitter = 0.5) 7.734
40 images/sec: 206.8 +/- 0.1 (jitter = 0.6) 7.969
50 images/sec: 206.7 +/- 0.1 (jitter = 0.5) 8.025
60 images/sec: 206.7 +/- 0.1 (jitter = 0.6) 7.896
70 images/sec: 206.8 +/- 0.1 (jitter = 0.6) 7.991
80 images/sec: 206.8 +/- 0.1 (jitter = 0.6) 7.816
90 images/sec: 206.8 +/- 0.1 (jitter = 0.5) 7.799
100 images/sec: 206.7 +/- 0.1 (jitter = 0.6) 7.821
----------------------------------------------------------------
total images/sec: 206.71
----------------------------------------------------------------
TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50 --use_fp16
Step Img/sec total_loss
1 images/sec: 198.6 +/- 0.0 (jitter = 0.0) 7.878
10 images/sec: 198.8 +/- 0.3 (jitter = 1.1) 7.958
20 images/sec: 199.3 +/- 0.2 (jitter = 0.6) 7.949
30 images/sec: 199.3 +/- 0.2 (jitter = 0.6) 7.948
40 images/sec: 199.4 +/- 0.1 (jitter = 0.7) 7.967
50 images/sec: 199.4 +/- 0.1 (jitter = 0.7) 7.706
60 images/sec: 199.4 +/- 0.1 (jitter = 0.7) 7.923
70 images/sec: 199.4 +/- 0.1 (jitter = 0.6) 7.849
80 images/sec: 199.4 +/- 0.1 (jitter = 0.5) 7.969
90 images/sec: 199.3 +/- 0.1 (jitter = 0.5) 7.810
100 images/sec: 199.3 +/- 0.1 (jitter = 0.6) 7.785
----------------------------------------------------------------
total images/sec: 199.25
----------------------------------------------------------------
Hmm, the perf numbers look normal for Vega FE workstation card. Can we have scripts/steps to repro your perf issue?
my test case is about 30GB, I'll see if I can minimize it, it's not mine
for sake of completeness, there's a 6GB archive available here https://hephaistos.lpp.polytechnique.fr/data/jeandet/dekken.tgz
extract and run test.py
CPU only runs into segfaults I have found, none reported in regular TF
Thanks @Dekken , let me try it out.
mine with i76950x
CUDA_VISIBLE_DEVICES="" python3 test.py
Epoch 1/10
/home/philix/.local/lib/python3.7/site-packages/keras/engine/training_utils.py:481: UserWarning: Found both `sample_weight` and `class_weight`: `class_weight` argument will be ignored.
warnings.warn('Found both `sample_weight` and `class_weight`: '
2019-02-21 22:28:37.008764: E tensorflow/stream_executor/rocm/rocm_driver.cc:1020] could not retrieve ROCM device count: HIP_ERROR_NoDevice
10229/10229 [==============================] - 205s 20ms/step - loss: 0.0491 - val_loss: 0.0161
Epoch 2/10
10229/10229 [==============================] - 204s 20ms/step - loss: 0.0383 - val_loss: 0.0141
Epoch 3/10
10229/10229 [==============================] - 204s 20ms/step - loss: 0.0348 - val_loss: 0.0137
Epoch 4/10
10229/10229 [==============================] - 203s 20ms/step - loss: 0.0330 - val_loss: 0.0123
Epoch 5/10
10229/10229 [==============================] - 203s 20ms/step - loss: 0.0317 - val_loss: 0.0118
Epoch 6/10
7636/10229 [=====================>........] - ETA: 49s - loss: 0.0307Segmentation fault (core dumped)
@Dekken , can you post the log without CUDA_VISIBLE_DEVICES=""
?
The environment variable works similarly to HIP_VISIBLE_DEVICES
, with that being set, tensorflow-rocm won't be able to see or use any GPUs.
the normal log is what takes 40+ minutes per epoch - so I'd rather not do that completely
2019-02-21 23:01:58.296943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1673] Adding visible gpu devices: 0, 1
2019-02-21 23:01:58.296970: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-21 23:01:58.296978: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1090] 0 1
2019-02-21 23:01:58.296986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 0: N N
2019-02-21 23:01:58.296993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] 1: N N
2019-02-21 23:01:58.297052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15306 MB memory) -> physical GPU (device: 0, name: Vega 10 XTX [Radeon Vega Frontier Edition], pci bus id: 0000:03:00.0)
2019-02-21 23:01:58.314591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1220] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 15306 MB memory) -> physical GPU (device: 1, name: Vega 10 XTX [Radeon Vega Frontier Edition], pci bus id: 0000:0c:00.0)
2019-02-21 23:02:03.002830: I tensorflow/core/kernels/conv_grad_input_ops.cc:997] running auto-tune for Backward-Data
2019-02-21 23:02:03.078605: I tensorflow/core/kernels/conv_grad_filter_ops.cc:886] running auto-tune for Backward-Filter
2019-02-21 23:02:03.164291: I tensorflow/core/kernels/conv_grad_input_ops.cc:997] running auto-tune for Backward-Data
2019-02-21 23:02:03.262944: I tensorflow/core/kernels/conv_grad_filter_ops.cc:886] running auto-tune for Backward-Filter
2019-02-21 23:02:03.310904: I tensorflow/core/kernels/conv_grad_filter_ops.cc:886] running auto-tune for Backward-Filter
222/10229 [..............................] - ETA: 40:30 - loss: 0.0836
Okay, I can repro your issue now. Looks like the GPUs were under very light loads during training. Next step would be to profile the workloads on ROCm and the other targets, and look into the device placement of each ops, please stay tuned.
Did you confirm CPU segfault by any chance?
Hi @Dekken , the CPU segfault was not observed on my local system. While we look into the issue, can you share a bit background of the case? e.g. is it based on a public project? Or are there any similar public models we can try as well?
It's not my project so I'm going to see if I can get a colleague in.
pinging @gautiernguyen @nicolasaunai
I think my segfault was to OOM
My colleagues are currently on vacation but this is a part of a PhD in the plasma physics department around "Earth's magnetopause detection from in situ satelites measurements with ML"
not that I know what this means
Made a bit of progress this week. Turns out we were using some non-optimal reduction code due to some stale ROCm-related ifdefs. We have some local changes that show significant improvements.
Here's where the (initial) bottleneck occurs:
https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/tensorflow/core/kernels/reduction_gpu_kernels.cu.h#L713-L736
Btw - thanks for reporting this issue.
@parallelo can’t wait for your PR J
@whchung
Currently wrapping up some mods. It is running now, and it appears to be converging (as best I can tell).
There's always room for more improvement, but the initial reduction kernel bottleneck appears to be fixed :-)
with #344 merged, should we be good?
if I compile "develop-upstream" myself I should be able to confirm?
Hi @Dekken - Worth another try now. Please check and report back here.
there's certainly a differnce
we're at 300 seconds an epoch
should we in theory be on par (or better) than a 1080ti with a VEGA FE?
Hi @Dekken , the TF-ROCm 1.14 release includes the PR #344. Can you check it there and let us know if it's okay to close this issue?
is there somewhere to continue to assess the performance vs a 1080ti?
@Dekken , one path is to profile your workload using RPT: https://scchan.github.io/hcc/md__home_scchan_code_hcc_doc_markdown_hcc_profile.html And then cross-compare the nvprof results from 1080ti. With that, we might be able to figure out if there's any other avenue or low hanging fruit to improve the performance.
Hey there,
I'm trialing some code to benchmark my VEGA vs a colleagues 1080ti.
I've noticed some very peculiar differences in time to fit per epoch, I'm guessing I'm messing up in some way.
For one epoch on AMD 2990WX ~400 seconds
For one epoch on 1080ti < 100s
For one epoch on VEGA FE > 40 minutes