Open hanhanyimo opened 4 years ago
Hi @hanhanyimo , can you provide the performance numbers on tf_cnn_benchmarks, comparing CPU vs GPU on your local machine? You can find the instructions here: https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-quickstart.md#tensorflows-tf_cnn_benchmarks
Besides, if you have not, please try with our pre-built docker containers below, to ensure your user-bit environment in good shape: https://hub.docker.com/r/rocm/tensorflow
Also behaving slow, and the docker is now (once it worked) returns:
>>> print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
2020-04-08 20:47:00.458792: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libhip_hcc.so
2020-04-08 20:47:00.473386: E tensorflow/stream_executor/rocm/rocm_driver.cc:975] could not retrieve ROCM device count: HIP_ERROR_NoDevice
2020-04-08 20:47:00.473411: E tensorflow/stream_executor/rocm/rocm_driver.cc:975] could not retrieve ROCM device count: HIP_ERROR_NoDevice
Num GPUs Available: 0
@atomobianco , the log implies ROCm driver stack was not properly configured on your system.
Are you able to execute /opt/rocm/bin/rocminfo
?
rocminfo is working as expected, giving the card as second agent
*******
Agent 2
*******
Name: gfx803
Marketing Name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
...
I once was working through docker container, but since recent versions I have this problem, so I am now working directly with tensorflow-rocm
, which has the poor performances discussed above.
Please make sure that this is an issue related to performance of TensorFlow. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:performance_template
System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g.,Linux Ubuntu 16.04): == check os platform =============================================== os: Linux os kernel version: #1 SMP Wed Aug 7 18:08:02 UTC 2019 os release version: 3.10.0-1062.el7.x86_64 os platform: Linux-3.10.0-1062.el7.x86_64-x86_64-with-centos-7.7.1908-Core linux distribution: ('CentOS Linux', '7.7.1908', 'Core') linux os distribution: ('centos', '7.7.1908', 'Core') mac version: ('', ('', '', ''), '') uname: uname_result(system='Linux', node='bogon', release='3.10.0-1062.el7.x86_64', version='#1 SMP Wed Aug 7 18:08:02 UTC 2019', machine='x86_64', processor='x86_64') architecture: ('64bit', '') machine: x86_64
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): == tensorflow import ============================================ tf.version.VERSION = 1.15.2 tf.version.GIT_VERSION = v1.15.0-16-g29b7532 tf.version.COMPILER_VERSION = 5.4.0 20160609 == check pips =================================================== numpy 1.18.1 protobuf 3.11.3 tensorflow-estimator 1.15.1 tensorflow-rocm 1.15.2
Python version: - Bazel version (if compiling from source): == check python =================================================== python version: 3.6.9 python branch: python build version: ('default', 'Jul 30 2019 19:07:31') python compiler version: GCC 7.3.0 python implementation: CPython
GCC/Compiler version (if compiling from source): == compiler ===================================================== c++ (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5) Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
CUDA/cuDNN version: - GPU model and memory:
GPU 8G rocm-3.1.0
rocminfo
ROCk module is loaded root is member of video group =====================
HSA System Attributes
=====================
Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
Agent 1
Name: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz Marketing Name: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3400
BDFID: 0
Internal Node ID: 0
Compute Unit: 4
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 7907180(0x78a76c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 7907180(0x78a76c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: TRUE
ISA Info:
N/A
Agent 2
Name: gfx803
Marketing Name: Polaris 20 XL [Radeon RX 580 2048SP] Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 28639(0x6fdf)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1306
BDFID: 256
Internal Node ID: 1
Compute Unit: 32
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8388608(0x800000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Acessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx803
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
Done
You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
2. TF 2.0:python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior To run TextCNN model with AMD RX580 8G, the training and evaluating as follows: Epoch: 1 Iter: 0, Train Loss: 6.2, Train Acc: 35.74%, Val Loss: 6.1, Val Acc: 34.82%, Time: 0:00:23 Iter: 100, Train Loss: 1.5, Train Acc: 95.90%, Val Loss: 1.2, Val Acc: 95.86%, Time: 0:00:54 This speed is the same with traing used CPU and radeontop command shows gpu run out.
Describe the expected behavior We hope model trainned by GPU should faster then trainned by CPU
Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.
Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. log_13083345 - 副本.log