ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
688 stars 95 forks source link

TextCNN with rocm-tensorflow has the same performance with tensorflow-cpu #895

Open hanhanyimo opened 4 years ago

hanhanyimo commented 4 years ago

Please make sure that this is an issue related to performance of TensorFlow. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:performance_template

System information

==========
HSA Agents
==========


Agent 1


Name: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz Marketing Name: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3400
BDFID: 0
Internal Node ID: 0
Compute Unit: 4
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED Size: 7907180(0x78a76c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 7907180(0x78a76c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: TRUE
ISA Info:
N/A


Agent 2


Name: gfx803
Marketing Name: Polaris 20 XL [Radeon RX 580 2048SP] Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 28639(0x6fdf)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1306
BDFID: 256
Internal Node ID: 1
Compute Unit: 32
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8388608(0x800000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Acessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx803
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension: x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension: x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
Done

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior To run TextCNN model with AMD RX580 8G, the training and evaluating as follows: Epoch: 1 Iter: 0, Train Loss: 6.2, Train Acc: 35.74%, Val Loss: 6.1, Val Acc: 34.82%, Time: 0:00:23 Iter: 100, Train Loss: 1.5, Train Acc: 95.90%, Val Loss: 1.2, Val Acc: 95.86%, Time: 0:00:54 This speed is the same with traing used CPU and radeontop command shows gpu run out.

Describe the expected behavior We hope model trainned by GPU should faster then trainned by CPU

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. log_13083345 - 副本.log

sunway513 commented 4 years ago

Hi @hanhanyimo , can you provide the performance numbers on tf_cnn_benchmarks, comparing CPU vs GPU on your local machine? You can find the instructions here: https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/blob/develop-upstream/rocm_docs/tensorflow-quickstart.md#tensorflows-tf_cnn_benchmarks

Besides, if you have not, please try with our pre-built docker containers below, to ensure your user-bit environment in good shape: https://hub.docker.com/r/rocm/tensorflow

atomobianco commented 4 years ago

Also behaving slow, and the docker is now (once it worked) returns:

>>> print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
2020-04-08 20:47:00.458792: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libhip_hcc.so
2020-04-08 20:47:00.473386: E tensorflow/stream_executor/rocm/rocm_driver.cc:975] could not retrieve ROCM device count: HIP_ERROR_NoDevice
2020-04-08 20:47:00.473411: E tensorflow/stream_executor/rocm/rocm_driver.cc:975] could not retrieve ROCM device count: HIP_ERROR_NoDevice
Num GPUs Available:  0
sunway513 commented 4 years ago

@atomobianco , the log implies ROCm driver stack was not properly configured on your system. Are you able to execute /opt/rocm/bin/rocminfo?

atomobianco commented 4 years ago

rocminfo is working as expected, giving the card as second agent

*******                  
Agent 2                  
*******                  
  Name:                    gfx803                             
  Marketing Name:          Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]
...

I once was working through docker container, but since recent versions I have this problem, so I am now working directly with tensorflow-rocm, which has the poor performances discussed above.