TextCNN with rocm-tensorflow has the same performance with tensorflow-cpu

hanhanyimo commented 4 years ago

Please make sure that this is an issue related to performance of TensorFlow. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:performance_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g.,Linux Ubuntu 16.04): == check os platform =============================================== os: Linux os kernel version: #1 SMP Wed Aug 7 18:08:02 UTC 2019 os release version: 3.10.0-1062.el7.x86_64 os platform: Linux-3.10.0-1062.el7.x86_64-x86_64-with-centos-7.7.1908-Core linux distribution: ('CentOS Linux', '7.7.1908', 'Core') linux os distribution: ('centos', '7.7.1908', 'Core') mac version: ('', ('', '', ''), '') uname: uname_result(system='Linux', node='bogon', release='3.10.0-1062.el7.x86_64', version='#1 SMP Wed Aug 7 18:08:02 UTC 2019', machine='x86_64', processor='x86_64') architecture: ('64bit', '') machine: x86_64
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): == tensorflow import ============================================ tf.version.VERSION = 1.15.2 tf.version.GIT_VERSION = v1.15.0-16-g29b7532 tf.version.COMPILER_VERSION = 5.4.0 20160609 == check pips =================================================== numpy 1.18.1 protobuf 3.11.3 tensorflow-estimator 1.15.1 tensorflow-rocm 1.15.2
Python version: - Bazel version (if compiling from source): == check python =================================================== python version: 3.6.9 python branch: python build version: ('default', 'Jul 30 2019 19:07:31') python compiler version: GCC 7.3.0 python implementation: CPython
GCC/Compiler version (if compiling from source): == compiler ===================================================== c++ (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5) Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
CUDA/cuDNN version: - GPU model and memory:
GPU 8G rocm-3.1.0

rocminfo

ROCk module is loaded root is member of video group =====================
HSA System Attributes
=====================
Runtime Version: 1.1 System Timestamp Freq.: 1000.000000MHz Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count) Machine Model: LARGE
System Endianness: LITTLE

==========
HSA Agents
==========

Agent 1

Name: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz Marketing Name: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3400
BDFID: 0
Internal Node ID: 0
Compute Unit: 4
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 7907180(0x78a76c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 7907180(0x78a76c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: TRUE
ISA Info:
N/A

Agent 2

Name: gfx803
Marketing Name: Polaris 20 XL [Radeon RX 580 2048SP] Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 28639(0x6fdf)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1306
BDFID: 256
Internal Node ID: 1
Compute Unit: 32
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8388608(0x800000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Acessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx803
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
Done

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior To run TextCNN model with AMD RX580 8G, the training and evaluating as follows: Epoch: 1 Iter: 0, Train Loss: 6.2, Train Acc: 35.74%, Val Loss: 6.1, Val Acc: 34.82%, Time: 0:00:23 Iter: 100, Train Loss: 1.5, Train Acc: 95.90%, Val Loss: 1.2, Val Acc: 95.86%, Time: 0:00:54 This speed is the same with traing used CPU and radeontop command shows gpu run out.

Describe the expected behavior We hope model trainned by GPU should faster then trainned by CPU

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. log_13083345 - 副本.log

sunway513 commented 4 years ago

Hi @hanhanyimo , can you provide the performance numbers on tf_cnn_benchmarks, comparing CPU vs GPU on your local machine? You can find the instructions here: https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-quickstart.md#tensorflows-tf_cnn_benchmarks

Besides, if you have not, please try with our pre-built docker containers below, to ensure your user-bit environment in good shape: https://hub.docker.com/r/rocm/tensorflow

atomobianco commented 4 years ago

Also behaving slow, and the docker is now (once it worked) returns:

>>> print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
2020-04-08 20:47:00.458792: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libhip_hcc.so
2020-04-08 20:47:00.473386: E tensorflow/stream_executor/rocm/rocm_driver.cc:975] could not retrieve ROCM device count: HIP_ERROR_NoDevice
2020-04-08 20:47:00.473411: E tensorflow/stream_executor/rocm/rocm_driver.cc:975] could not retrieve ROCM device count: HIP_ERROR_NoDevice
Num GPUs Available:  0

sunway513 commented 4 years ago

@atomobianco , the log implies ROCm driver stack was not properly configured on your system. Are you able to execute /opt/rocm/bin/rocminfo?

atomobianco commented 4 years ago

rocminfo is working as expected, giving the card as second agent

*******                  
Agent 2                  
*******                  
  Name:                    gfx803                             
  Marketing Name:          Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
...

I once was working through docker container, but since recent versions I have this problem, so I am now working directly with tensorflow-rocm, which has the poor performances discussed above.

ROCm / tensorflow-upstream

TextCNN with rocm-tensorflow has the same performance with tensorflow-cpu #895

rocminfo