deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.41k stars 486 forks source link

[BUG] RecursionError with dp compress in v2.2.10 with TF 2.15 #3920

Closed kimurin closed 1 week ago

kimurin commented 1 week ago

Bug summary

When I run dp compress -i graph.pb -o graph-compress.pb an error occurs:

DEEPMD INFO    stage 2: freeze the model
Traceback (most recent call last):
  File "/opt/software/deepmd-kit/v2/bin/dp", line 10, in <module>
    sys.exit(main())
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/deepmd_utils/main.py", line 657, in main
    deepmd_main(args)
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/deepmd/entrypoints/main.py", line 82, in main
    compress(**dict_args)
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/deepmd/entrypoints/compress.py", line 176, in compress
    freeze(checkpoint_folder=checkpoint_folder, output=output, node_names=None)
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/deepmd/entrypoints/freeze.py", line 518, in freeze
    import horovod.tensorflow as HVD
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/horovod/tensorflow/__init__.py", line 44, in <module>
    from horovod.tensorflow.sync_batch_norm import SyncBatchNormalization
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/horovod/tensorflow/sync_batch_norm.py", line 22, in <module>
    class SyncBatchNormalization(tf.keras.layers.BatchNormalization):
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 170, in __getattr__
    module = self._load()
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 50, in _load
    module = importlib.import_module(self.__name__)
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 170, in __getattr__
    module = self._load()
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 50, in _load
    module = importlib.import_module(self.__name__)
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 143, in __getattr__
    if item in ("_mode", "_initialized", "_name"):
RecursionError: maximum recursion depth exceeded in comparison

Using offline package deepmd-kit-2.2.10-cuda124-Linux-x86_64 with TensorFlow 2.15.0 and CUDA 12.4. No other modules are installed. Bug occurs when following the official tutorial.

DeePMD-kit Version

deepmd-kit-2.2.10-cuda124

Backend and its version

TensorFlow v2.15.0

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

(base) [user@node 01.train]$ dp compress -i graph.pb -o graph-compress.pb
2024-06-27 12:36:25.509652: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-27 12:36:25.509823: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-27 12:36:25.510661: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-27 12:36:25.517532: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From /opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-06-27 12:36:28.717776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:28.720166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-06-27 12:36:28.736639: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-06-27 12:36:28.778299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:28.779827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-06-27 12:36:29.175183: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:29.176735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
DEEPMD INFO

DEEPMD INFO    stage 1: compress the model
WARNING:tensorflow:Your environment has TF_USE_LEGACY_KERAS set to True, but you do not have the tf_keras package installed. You must install it in order to use the legacy tf.keras. Install it via: `pip install tf_keras`
WARNING:tensorflow:Your environment has TF_USE_LEGACY_KERAS set to True, but you do not have the tf_keras package installed. You must install it in order to use the legacy tf.keras. Install it via: `pip install tf_keras`
DEEPMD WARNING Switch to serial execution due to lack of horovod module.
DEEPMD INFO     _____               _____   __  __  _____           _     _  _
DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |
DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_
DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_
DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
DEEPMD INFO    Please read and cite:
DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD INFO    Zeng et al, J. Chem. Phys., 159, 054801 (2023)
DEEPMD INFO    See https://deepmd.rtfd.io/credits/ for details.
DEEPMD INFO    installed to:         /usr/local
DEEPMD INFO    source :
DEEPMD INFO    source brach:         HEAD
DEEPMD INFO    source commit:        e08ccaf
DEEPMD INFO    source commit at:     2024-04-06 23:39:30 +0000
DEEPMD INFO    build float prec:     double
DEEPMD INFO    build variant:        cuda
DEEPMD INFO    build with tf inc:    /opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/include/;/opt/software/deepmd-kit/v2/include
DEEPMD INFO    build with tf lib:
DEEPMD INFO    ---Summary of the training---------------------------------------
DEEPMD INFO    running on:           node
DEEPMD INFO    computing device:     gpu:0
DEEPMD INFO    CUDA_VISIBLE_DEVICES: unset
DEEPMD INFO    Count of visible GPU: 2
DEEPMD INFO    num_intra_threads:    0
DEEPMD INFO    num_inter_threads:    0
DEEPMD INFO    -----------------------------------------------------------------
2024-06-27 12:36:33.411884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:33.413405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
DEEPMD INFO    training without frame parameter
2024-06-27 12:36:33.509570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:33.511142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-06-27 12:36:33.525433: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:33.526960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-06-27 12:36:33.574807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:33.576330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-06-27 12:36:33.619329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:33.620883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
DEEPMD INFO    training data with lower boundary: [-0.36005299 -0.38848973]
DEEPMD INFO    training data with upper boundary: [7.68739894 8.69554   ]
2024-06-27 12:36:34.737253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:34.744968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-06-27 12:36:34.825751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:34.827318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-06-27 12:36:34.874729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:34.876301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory:  -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
DEEPMD INFO    built lr
DEEPMD INFO    built network
DEEPMD INFO    built training
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-06-27 12:36:35.944095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory:  -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
DEEPMD INFO    initialize model from scratch
DEEPMD INFO    finished compressing
DEEPMD INFO

DEEPMD INFO    stage 2: freeze the model
Traceback (most recent call last):
  File "/opt/software/deepmd-kit/v2/bin/dp", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/deepmd_utils/main.py", line 657, in main
    deepmd_main(args)
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/deepmd/entrypoints/main.py", line 82, in main
    compress(**dict_args)
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/deepmd/entrypoints/compress.py", line 176, in compress
    freeze(checkpoint_folder=checkpoint_folder, output=output, node_names=None)
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/deepmd/entrypoints/freeze.py", line 518, in freeze
    import horovod.tensorflow as HVD
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/horovod/tensorflow/__init__.py", line 44, in <module>
    from horovod.tensorflow.sync_batch_norm import SyncBatchNormalization
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/horovod/tensorflow/sync_batch_norm.py", line 22, in <module>
    class SyncBatchNormalization(tf.keras.layers.BatchNormalization):
                                 ^^^^^^^^^^^^^^^
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 170, in __getattr__
    module = self._load()
             ^^^^^^^^^^^^
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 50, in _load
    module = importlib.import_module(self.__name__)
                                     ^^^^^^^^^^^^^
...(skipped repeated lines)...
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 50, in _load
    module = importlib.import_module(self.__name__)
                                     ^^^^^^^^^^^^^
  File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 143, in __getattr__
    if item in ("_mode", "_initialized", "_name"):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RecursionError: maximum recursion depth exceeded in comparison

Steps to Reproduce

wget https://dp-public.oss-cn-beijing.aliyuncs.com/community/DeePMD-kit-FastLearn.tar tar xvf DeePMD-kit-FastLearn.tar cd DeePMD-kit-FastLearn/01.train dp train input.json dp freeze -o graph.pb dp compress -i graph.pb -o graph-compress.pb

Further Information, Files, and Links

There are no such error with deepmd-kit-2.2.10-cuda118-Linux-x86_64 (TensorFlow 2.14 and CUDA 11.8).

njzjz commented 1 week ago

Thanks for reporting. This was fixed by https://github.com/conda-forge/deepmd-kit-feedstock/pull/75, but it seems that the offline package fetched the old package. I'll trigger a rebuild then.

njzjz commented 1 week ago

Rebuilt: https://github.com/deepmd-kit-recipes/installer/actions/runs/9702439612