[Issue]: tensorflow benchmarks not working - ModuleNotFoundError: No module named 'keras.legacy_tf_layers'

baryluk commented 9 months ago

Problem Description

Linux 6.7-rc4, amd64

user@debian:~/v2$ docker run -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
  --device=/dev/kfd --device=/dev/dri --group-add video \
  --ipc=host --shm-size 8G rocm/tensorflow:latest
root@8174e9a2cef2:/root# cd benchmarks/
root@8174e9a2cef2:/root/benchmarks# python3 ./scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
                                            --num_gpus=1 --model resnet50 --batch_size 32
2023-12-18 02:36:28.379972: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From /usr/local/lib/python3.9/dist-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /usr/local/lib/python3.9/dist-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
2023-12-18 02:36:31.573685: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-12-18 02:36:31.628949: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-12-18 02:36:31.629068: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-12-18 02:36:31.631059: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-12-18 02:36:31.631197: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-12-18 02:36:31.631290: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-12-18 02:36:31.631598: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-12-18 02:36:31.631705: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-12-18 02:36:31.631812: I tensorflow/compiler/xla/stream_executor/rocm/rocm_gpu_executor.cc:838] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-12-18 02:36:31.631880: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15506 MB memory:  -> device: 0, name: AMD Radeon RX 6900 XT, pci bus id: 0000:44:00.0
TensorFlow:  2.13
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  32 global
             32 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['/gpu:0']
NUMA bind:   False
Data format: NCHW
Optimizer:   sgd
Variables:   parameter_server
==========
Generating training model
/usr/local/lib/python3.9/dist-packages/tensorflow/python/keras/legacy_tf_layers/convolutional.py:409: UserWarning: `tf.layers.conv2d` is deprecated and will be removed in a future version. Please Use `tf.keras.layers.Conv2D` instead.
  warnings.warn('`tf.layers.conv2d` is deprecated and '
/usr/local/lib/python3.9/dist-packages/tensorflow/python/keras/engine/base_layer_v1.py:1697: UserWarning: `layer.apply` is deprecated and will be removed in a future version. Please use `layer.__call__` method instead.
  warnings.warn('`layer.apply` is deprecated and '
Traceback (most recent call last):
  File "/root/benchmarks/./scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 68, in <module>
    app.run(main)  # Raises error on invalid flags, unlike tf.app.run()
  File "/root/.local/lib/python3.9/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/root/.local/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/root/benchmarks/./scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 63, in main
    bench.run()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 1876, in run
    return self._benchmark_train()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 2072, in _benchmark_train
    build_result = self._build_graph()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 2104, in _build_graph
    (input_producer_op, enqueue_ops, fetches) = self._build_model()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 2822, in _build_model
    results = self.add_forward_pass_and_gradients(
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 3345, in add_forward_pass_and_gradients
    outputs = maybe_compile(forward_pass_and_gradients, self.params)
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 3542, in maybe_compile
    return computation()
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 3198, in forward_pass_and_gradients
    build_network_result = self.model.build_network(
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/models/model.py", line 289, in build_network
    self.add_inference(network)
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/models/resnet_model.py", line 304, in add_inference
    cnn.conv(64, 7, 7, 2, 2, mode='SAME_RESNET', use_batch_norm=True)
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py", line 226, in conv
    biased = self.batch_norm(**self.batch_norm_config)
  File "/root/benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py", line 465, in batch_norm
    layer_obj = normalization_layers.BatchNormalization(
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/layers/normalization.py", line 30, in __getattr__
    return normalization.BatchNormalization
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/util/lazy_loader.py", line 58, in __getattr__
    module = self._load()
  File "/usr/local/lib/python3.9/dist-packages/tensorflow/python/util/lazy_loader.py", line 41, in _load
    module = importlib.import_module(self.__name__)
  File "/usr/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 972, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'keras.legacy_tf_layers'
root@8174e9a2cef2:/root/benchmarks#

No issues with other programs, i.e. rocm-terminal or pytorch docker images do work, and all functionality works.

Looks like some issues with packages, paths, etc.

Operating System

asd

CPU

asd

GPU

AMD Instinct MI250X

Other

No response

ROCm Version

ROCm 6.0?

Docker image output:

$ docker image ls | grep tensor
rocm/tensorflow   latest    0db6c42705bf   3 months ago   31.9GB

In the container:

root@2ed394bc95a7:/root/benchmarks# dpkg -l | grep -E 'rocm|hip'
ii  rocm-clang-ocl                         0.5.0.50700-63~20.04              amd64        OpenCL compilation with clang compiler.
ii  rocm-cmake                             0.10.0.50700-63~20.04             amd64        rocm-cmake built using CMake
ii  rocm-core                              5.7.0.50700-63~20.04              amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-dbgapi                            0.70.1.50700-63~20.04             amd64        Library to provide AMD GPU debugger API
ii  rocm-debug-agent                       2.0.3.50700-63~20.04              amd64        Radeon Open Compute Debug Agent (ROCdebug-agent)
ii  rocm-dev                               5.7.0.50700-63~20.04              amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-device-libs                       1.0.0.50700-63~20.04              amd64        Radeon Open Compute - device libraries
ii  rocm-gdb                               13.2.50700-63~20.04               amd64        ROCgdb
ii  rocm-libs                              5.7.0.50700-63~20.04              amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocm-llvm                              17.0.0.23352.50700-63~20.04       amd64        ROCm compiler
ii  rocm-ocl-icd                           2.0.0.50700-63~20.04              amd64        clr built using CMake
ii  rocm-opencl                            2.0.0.50700-63~20.04              amd64        clr built using CMake
ii  rocm-opencl-dev                        2.0.0.50700-63~20.04              amd64        clr built using CMake
ii  rocm-smi-lib                           5.0.0.50700-63~20.04              amd64        AMD System Management libraries
ii  rocm-utils                             5.7.0.50700-63~20.04              amd64        Radeon Open Compute (ROCm) Runtime software stack
ii  rocminfo                               1.0.0.50700-63~20.04              amd64        Radeon Open Compute (ROCm) Runtime rocminfo tool
ii  hip-dev                                5.7.31921.50700-63~20.04          amd64        HIP:Heterogenous-computing Interface for Portability
ii  hip-doc                                5.7.31921.50700-63~20.04          amd64        HIP:Heterogenous-computing Interface for Portability
ii  hip-runtime-amd                        5.7.31921.50700-63~20.04          amd64        HIP:Heterogenous-computing Interface for Portability
ii  hip-samples                            5.7.31921.50700-63~20.04          amd64        HIP: Heterogenous-computing Interface for Portability [HIP SAMPLES]
ii  hipblas                                1.1.0.50700-63~20.04              amd64        Radeon Open Compute BLAS marshalling library
ii  hipblas-dev                            1.1.0.50700-63~20.04              amd64        Radeon Open Compute BLAS marshalling library
ii  hipblaslt                              0.3.0.50700-63~20.04              amd64        Radeon Open Compute BLAS marshalling library
ii  hipblaslt-dev                          0.3.0.50700-63~20.04              amd64        Radeon Open Compute BLAS marshalling library
ii  hipcc                                  1.0.0.50700-63~20.04              amd64        HIP Compiler Driver
ii  hipcub-dev                             2.13.1.50700-63~20.04             amd64        hipCUB (rocPRIM backend)
ii  hipfft                                 1.0.12.50700-63~20.04             amd64        ROCm FFT marshalling library
ii  hipfft-dev                             1.0.12.50700-63~20.04             amd64        ROCm FFT marshalling library
ii  hipify-clang                           17.0.0.50700-63~20.04             amd64        Hipify CUDA source
ii  hipsolver                              1.8.1.50700-63~20.04              amd64        Radeon Open Compute LAPACK marshalling library
ii  hipsolver-dev                          1.8.1.50700-63~20.04              amd64        Radeon Open Compute LAPACK marshalling library
ii  hipsparse                              2.3.8.50700-63~20.04              amd64        Radeon Open Compute SPARSE library
ii  hipsparse-dev                          2.3.8.50700-63~20.04              amd64        Radeon Open Compute SPARSE library
ii  miopen-hip                             2.20.0.50700-63~20.04             amd64        AMD's DNN Library
ii  miopen-hip-dev                         2.20.0.50700-63~20.04             amd64        AMD's DNN Library
ii  miopen-hip-gfx1030kdb                  2.20.0.50700-63~20.04             amd64        AMD's DNN Library
ii  miopen-hip-gfx900kdb                   2.20.0.50700-63~20.04             amd64        AMD's DNN Library
ii  miopen-hip-gfx906kdb                   2.20.0.50700-63~20.04             amd64        AMD's DNN Library
ii  miopen-hip-gfx908kdb                   2.20.0.50700-63~20.04             amd64        AMD's DNN Library
ii  miopen-hip-gfx90akdb                   2.20.0.50700-63~20.04             amd64        AMD's DNN Library

Inside container:

root@2ed394bc95a7:/root/benchmarks# pip3 list | grep -E 'tensor|hip|keras'
keras                        2.13.1
tensorboard                  2.13.0
tensorboard-data-server      0.7.1
tensorflow-estimator         2.13.0
tensorflow-io-gcs-filesystem 0.34.0
tensorflow-rocm              2.13.0.570

[notice] A new release of pip is available: 23.2.1 -> 23.3.2
[notice] To update, run: python3 -m pip install --upgrade pip
root@2ed394bc95a7:/root/benchmarks#

ROCm Component

Other

Steps to Reproduce

above

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

LakshmiKumar23 commented 8 months ago

@kiritigowda This issue is not for rocAL or any of the MIVisionX components. This should be redirected to rocm/tensorflow team

ppanchad-amd commented 2 months ago

@baryluk Internal ticket has been created to investigate this issue. Thanks!

schung-amd commented 2 months ago

Hi @baryluk, are you still experiencing this issue? The current docker image at rocm/tensorflow:latest has updated packages and I was unable to reproduce the issue with the following package versions:

tf-docker /benchmarks > pip3 list | grep -E 'tensor|hip|keras'
keras                        2.15.0
tensorboard                  2.15.2
tensorboard-data-server      0.7.2
tensorflow-estimator         2.15.0
tensorflow-io-gcs-filesystem 0.36.0
tensorflow-rocm              2.15.0

Notably, however, the benchmarks are deprecated since 2020 (see https://github.com/tensorflow/benchmarks/blob/master/README.md), and I had to manually apply two recent commits (https://github.com/tensorflow/benchmarks/commit/559f08f2cc76c9dbd30f25abc66acc516d1b4bd0 and https://github.com/tensorflow/benchmarks/commit/5996abc324ca18267e7299a74cc05249eb3b3c3a). In general I would not expect these benchmarks to run without modifying the files to match your current version of TensorFlow.

schung-amd commented 2 months ago

I'll be closing this issue, my recommendation is to user PerfZero for benchmarking (as recommended in https://github.com/tensorflow/benchmarks/blob/master/README.md) as benchmarks is deprecated. If you must use benchmarks instead of PerfZero and experience a similar issue with the latest docker and the modifications I mentioned, please open a new issue.

ROCm / MIVisionX