Open paolodalberto opened 9 months ago
feel free to reach me directly/internally ... thank you Paolo
I observed the same behaviour and thought of an incompatibility between ROCm 5.6 and TF 2.13. But that was just a wild guess.
My home set up with the new tensoflow:latest docker does the same (different GPUs Radeon VII). this is a show stopper ... any attention will be appreciated !
ls /etc/alternatives/roc -lrt
roc-obj roc-obj-ls rocm/ rocm_agent_enumerator rocprof
roc-obj-extract rocgdb rocm-smi rocminfo rocprofv2
:/root# ls /etc/alternatives/rocm -lrt
lrwxrwxrwx 1 root root 15 Sep 16 23:54 /etc/alternatives/rocm -> /opt/rocm-5.7.0
drwxr-xr-x 1 root root 4096 Sep 16 23:54 rocm-5.7.0
lrwxrwxrwx 1 root root 22 Sep 16 23:54 rocm -> /etc/alternatives/rocm
r
the rocm version default seems to be 5.7 but hip is 5.6 ?
Any takers for this Issue ?
Is there any one ?
echo ... echo ... echo
shoot an email paolod AT amd.com
Same here. Is there any update?
@gzitzlsb-it4i no updates on my side
keeping the comments alive ...
I see this issue with both rocm5.7-tf2.12-dev and rocm5.7-tf2.13-dev. Reverted now to rocm5.6-tf2.12-dev, which works well.
Maybe this is related to the change of rom 5.6->5.7?
Thanksgiving ... take your time. @gzitzlsb-it4i , I tested it from a docker image ... tensorflow:latest should it be addressed there ? Who knows ... one day
I'm observing the same problem with rocm 5.7 and both tf 2.12 and tf 2.13. It does not appear with rocm 5.6 and tf 2.12.
Anyone can redirect me to a person I can talk to ?
I guess we will wait for rocm 6
I tried to pull again, there is no new version is there any thing I can do ?
is there a docker tensorflow for rocm 6 ? I removed and pulled it again and it is still 5.7
Keep this alive because the last pull did not fix this thank you and Happy Holidays !
any update ?
REPOSITORY TAG IMAGE ID CREATED SIZE
rocm/tensorflow latest a169c415feb2 2 weeks ago 37.2GB
<none> <none> 36781c65cb73 2 months ago 45.5GB
containers.xilinx.com/acdc/build 2.0 b66986b55092 2 months ago 6.71GB
rocm/tensorflow <none> 0db6c42705bf 3 months ago 31.9GB
rocm/pytorch latest 1cd3cad3f90f 3 months ago 52.1GB
PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU')
2024-01-09 00:06:37.372844: E tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc:1294] failed to query device memory info: HIP_ERROR_InvalidValue
Traceback (most recent call last):
File "/dockerx/test_user.py", line 212, in <module>
gpus = tf.config.list_physical_devices('GPU')
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/framework/config.py", line 491, in list_logical_devices
return context.context().list_logical_devices(device_type=device_type)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/context.py", line 1688, in list_logical_devices
self.ensure_initialized()
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/context.py", line 598, in ensure_initialized
context_handle = pywrap_tfe.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.UnknownError: Failed to query available memory for GPU 0
(Pdb) l
215 try:
216 # Currently, memory growth needs to be the same across GPUs
217 for gpu in gpus:
218 print(gpu)
219 tf.config.experimental.set_memory_growth(gpu, True)
220 -> logical_gpus = tf.config.list_logical_devices('GPU')
221 print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
222 except RuntimeError as e:
223 # Memory growth must be set before GPUs have been initialized
224 print(e)
225
(Pdb) n
2024-01-09 00:09:06.679489: E tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc:1294] failed to query device memory info: HIP_ERROR_InvalidValue
tensorflow.python.framework.errors_impl.UnknownError: Failed to query available memory for GPU 0
> /dockerx/test_user.py(220)<module>()
I thought the latest drop would address this .... but how can you address it if you do not acknowledge ... the suspense.
I also confirm that ROCM 6.0 and tensorflow 2.14 still do not work on MI250X, the same error pops up:
2024-01-10 20:25:42.726550: E tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc:1294] failed to query device memory info: HIP_ERROR_InvalidValue
ROCM+tensorflow is becoming badly out of date and unusable on large HPC systems that made the mistake of buying AMD MI250X.
Someday I'll wish upon a star Wake up where the clouds are far behind me Where trouble melts like lemon drops
@paolodalberto @jpata what AMDGPU driver version are you trying to run the container on? On our HPC system we have a rather old one, Driver version: 5.16.9.22.20
due to an outdated ROCm 5.2.3 version present in the Cray environment. @jpata I assume you use LUMI, which should have a similar issue.
I believe no matter the container version you use, the issue is the driver on the host system.
@dipietrantonio excellent point, thanks a lot! I confirm that LUMI HPC where I'm experiencing this issue uses 5.16.9.22.20
.
I used my home system (VEGA VII with upgraded ubuntu) and two more advanced ones with MI100 and upgraded recently. Pythorch works
tf-docker / > bash /dockerx/test.sh
/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.18) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
> /dockerx/test_user.py(212)<module>()
-> gpus = tf.config.list_physical_devices('GPU')
(Pdb) c
3 Physical GPUs
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')
PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU')
2024-02-06 22:30:38.546437: E tensorflow/compiler/xla/stream_executor/rocm/rocm_driver.cc:1294] failed to query device memory info: HIP_ERROR_InvalidValue
Traceback (most recent call last):
File "/dockerx/test_user.py", line 212, in <module>
gpus = tf.config.list_physical_devices('GPU')
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/framework/config.py", line 491, in list_logical_devices
return context.context().list_logical_devices(device_type=device_type)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/context.py", line 1688, in list_logical_devices
self.ensure_initialized()
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/context.py", line 598, in ensure_initialized
context_handle = pywrap_tfe.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.UnknownError: Failed to query available memory for GPU 0
Dear @paolodalberto @jpata ,
We have installed a newer version of the ROCm driver (6.0.5) on a bunch of nodes for testing and now my container with ROCm 5.7 and TF 2.13 works on the code posted in the description of this issue. The error is gone :) So it is a driver issue as I expected.
$ export CIMAGE=$MYSOFTWARE/tensorflow-2.23-rocm5.7.sif
$ singularity exec $CIMAGE python3 tf_test.py
2024-02-07 14:35:41.241861: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
1 Physical GPUs
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
2024-02-07 14:35:46.934224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 63938 MB memory: -> device: 0, name: AMD Instinct MI250X, pci bus id: 0000:d1:00.0
1 Physical GPUs, 1 Logical GPUs
$ cat tf_test.py
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
if gpus:
print(len(gpus), "Physical GPUs")
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
print(gpu)
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
@dipietrantonio excuse me for my thickness. The driver does not come with the docker? You are saying that rocm driver 5.7 is the problem ...
When you run a container you rely on the host kernel, not the one installed in your container. The driver is a kernel module. You need to update the driver on the system you are running the container on (at least when you use the Singularity container engine, but I think it is the same for Docker).
The problem for me was that the driver of ROCm 5.2 was the issue. I was not expecting that even the ROCm 5.7 driver could have this issue. But as I said, the driver version 6.0.5 solved my issue.
hmm ...
Still no new docker with rocm 6
new docker arrived
usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.18) or chardet (3.0.4) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
/usr/local/lib/python3.9/dist-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.26.4
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Traceback (most recent call last):
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/data/util/structure.py", line 105, in normalize_element
spec = type_spec_from_value(t, use_fallback=False)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/data/util/structure.py", line 514, in type_spec_from_value
raise TypeError("Could not build a `TypeSpec` for {} with type {}".format(
TypeError: Could not build a `TypeSpec` for ['/imagenet/train/n02102177/n02102177_9088.JPEG', '/imagenet/train/n01796340/n01796340_3887.JPEG', '/imagenet/train/n02363005/n02363005_6465.JPEG', '/imagenet/train/n02965783/n02965783_1876.JPEG', '/imagenet/train/n01734418/n01734\
418_12680.JPEG', '/imagenet/train/n02422699/n02422699_28690.JPEG',
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/dockerx/test_user.py", line 268, in <module>
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
File "/usr/local/lib/python3.9/dist-packages/keras/src/utils/image_dataset.py", line 308, in image_dataset_from_directory
dataset = paths_and_labels_to_dataset(
File "/usr/local/lib/python3.9/dist-packages/keras/src/utils/image_dataset.py", line 350, in paths_and_labels_to_dataset
path_ds = tf.data.Dataset.from_tensor_slices(image_paths)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 825, in from_tensor_slices
return from_tensor_slices_op._from_tensor_slices(tensors, name)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/data/ops/from_tensor_slices_op.py", line 25, in _from_tensor_slices
return _TensorSliceDataset(tensors, name=name)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/data/ops/from_tensor_slices_op.py", line 33, in __init__
element = structure.normalize_element(element)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/data/util/structure.py", line 110, in normalize_element
ops.convert_to_tensor(t, name="component_%d" % i))
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/profiler/trace.py", line 183, in wrapped
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/framework/ops.py", line 696, in convert_to_tensor
return tensor_conversion_registry.convert(
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/framework/tensor_conversion_registry.py", line 234, in convert
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/framework/constant_op.py", line 335, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/ops/weak_tensor_ops.py", line 142, in wrapper
return op(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/framework/constant_op.py", line 271, in constant
return _constant_impl(value, dtype, shape, name, verify_shape=False,
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/framework/constant_op.py", line 284, in _constant_impl
return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/framework/constant_op.py", line 296, in _constant_eager_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/framework/constant_op.py", line 102, in convert_to_eager_tensor
ctx.ensure_initialized()
File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/eager/context.py", line 603, in ensure_initialized
context_handle = pywrap_tfe.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.UnknownError: Failed to query available memory for GPU 0
this is with my system at home and I will check on monday on the real machine
https://github.com/ROCm/tensorflow-upstream/issues/2289#issuecomment-1931424826 how do you upgrade the driver ?
I could 1978 sudo apt update 1979 wget https://repo.radeon.com/amdgpu-install/6.1/ubuntu/jammy/amdgpu-install_6.1.60100-1_all.deb 1980 sudo apt install ./amdgpu-install_6.1.60100-1_all.deb 1981 sudo amdgpu-install --list-usecase 1982 sudo amdgpu-install --usecase=dkms,rocm,graphics,hiplibsdk,workstation,asan 1983 sudo amdgpu-install --usecase=dkms,rocm,graphics,hiplibsdk,hip 1984 sudo amdgpu-install --usecase=dkms,rocm,rocmdev,opencl,graphics,hiplibsdk,hip 1985 sudo amdgpu-install --usecase=dkms 1986 sudo amdgpu-install --usecase=dkms,rocm,rocmdev,rocmdevtools 1987 sudo amdgpu-install --usecase=dkms,rocm 1988 sudo amdgpu-install --usecase=dkms,rocmdev, rocm 1989 sudo amdgpu-install --usecase=dkms 1990 sudo reboot
At least it works for one GPU
Let me check what I can do on my large machine ...
the large machine now kicks me out during evaluation but I can see briefly the GPUs
yep multiple GPUs do not work (single GPU works)
In practice the multiple GPUs fails so badly that the docker application stalls the machine and breaks the docker deamon that I have to restart manually. This is on a system above ... The funny part this was working on 5.7, 6 months ago .... for tensor flow and pytorch ...
let me know if you like to connect ...
Good times
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
No
Source
binary
TensorFlow version
v2.13.0-4108-g619eb25934e 2.13.0
Custom code
No
OS platform and distribution
Linux xsjfislx32 5.15.0-83-generic #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Mobile device
No response
Python version
Python 3.9.18
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current behavior?
This is the smallest piece of code from a tutorial that reproduce my problem.
Standalone code to reproduce the issue
Relevant log output