ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
683 stars 93 forks source link

Tensorflow detects my GPU, but doesn't use it because "supported AMDGPU versions" is mistyped. #2488

Closed berinaniesh closed 2 months ago

berinaniesh commented 2 months ago

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

2.14

Custom code

Yes

OS platform and distribution

Docker latest rocm

Mobile device

No response

Python version

3.9.18

Bazel version

No response

GCC/compiler version

9.4.0

CUDA/cuDNN version

ROCM 6.0, runtime version 1.1

GPU model and memory

Radeon 6800m (gfx1031, converted to gfx1030 with HSA_OVERRIDE, has worked in previous versions))

Current behavior?

I'm using Tensorflow from the official ROCM tensorflow docker image (latest, tf.__version__=2.14). I have a Radeon 6800m (Asus G513QY laptop). The GPU is gfx1031 (which is unsupported), but I can set the variable of HSA_OVERRIDE_GFX_VERSION=10.3.0 to change the GPU to gfx1030. It has worked well in the past. The newer version of the docker image detects my GPU as gfx1030, but doesn't use it because it is not present in the list of supported GPUs.

But gfx1030 is present in the list of supported GPUs, but there is a space missing between gfx1030 and gfx1100 and both words combine as gfx1030gfx1100 and fails to acknowledge that gfx1030 is a valid GPU. The output can be found below.

2024-04-08 08:21:28.484960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2266] Ignoring visible gpu device (device: 0, name: AMD Radeon RX 6800M, pci bus id: 0000:03:00.0) with AMDGPU version : gfx1030. The supported AMDGPU versions are gfx1030gfx1100, gfx900, gfx906, gfx908, gfx90a, gfx940, gfx941, gfx942.

I searched for the string of gfx1030gfx1100 in this repo as well as the rocm docker repo, but couldn't find any. Can someone fix this?

Standalone code to reproduce the issue

Run tensorflow from docker and try to list the GPUs. It will detect the GPU, but won't use it.

Relevant log output

2024-04-08 08:21:28.484960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2266] Ignoring visible gpu device (device: 0, name: AMD Radeon RX 6800M, pci bus id: 0000:03:00.0) with AMDGPU version : gfx1030. The supported AMDGPU versions are gfx1030gfx1100, gfx900, gfx906, gfx908, gfx90a, gfx940, gfx941, gfx942.
berinaniesh commented 2 months ago

Apparently, this is being fixed in https://github.com/ROCm/tensorflow-upstream/pull/2434/commits/632e25544c6881a8acf798827d3699281795fbf8 and a new release has not been made. Another issue is open for a release to be made. https://github.com/ROCm/tensorflow-upstream/issues/2487. So closing this.