When I run dp compress -i graph.pb -o graph-compress.pb an error occurs:
DEEPMD INFO stage 2: freeze the model
Traceback (most recent call last):
File "/opt/software/deepmd-kit/v2/bin/dp", line 10, in <module>
sys.exit(main())
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/deepmd_utils/main.py", line 657, in main
deepmd_main(args)
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/deepmd/entrypoints/main.py", line 82, in main
compress(**dict_args)
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/deepmd/entrypoints/compress.py", line 176, in compress
freeze(checkpoint_folder=checkpoint_folder, output=output, node_names=None)
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/deepmd/entrypoints/freeze.py", line 518, in freeze
import horovod.tensorflow as HVD
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/horovod/tensorflow/__init__.py", line 44, in <module>
from horovod.tensorflow.sync_batch_norm import SyncBatchNormalization
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/horovod/tensorflow/sync_batch_norm.py", line 22, in <module>
class SyncBatchNormalization(tf.keras.layers.BatchNormalization):
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 170, in __getattr__
module = self._load()
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 50, in _load
module = importlib.import_module(self.__name__)
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 170, in __getattr__
module = self._load()
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 50, in _load
module = importlib.import_module(self.__name__)
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 143, in __getattr__
if item in ("_mode", "_initialized", "_name"):
RecursionError: maximum recursion depth exceeded in comparison
Using offline package deepmd-kit-2.2.10-cuda124-Linux-x86_64 with TensorFlow 2.15.0 and CUDA 12.4. No other modules are installed. Bug occurs when following the official tutorial.
DeePMD-kit Version
deepmd-kit-2.2.10-cuda124
Backend and its version
TensorFlow v2.15.0
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
(base) [user@node 01.train]$ dp compress -i graph.pb -o graph-compress.pb
2024-06-27 12:36:25.509652: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-27 12:36:25.509823: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-27 12:36:25.510661: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-27 12:36:25.517532: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From /opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/compat/v2_compat.py:108: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-06-27 12:36:28.717776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:28.720166: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory: -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-06-27 12:36:28.736639: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-06-27 12:36:28.778299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:28.779827: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory: -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-06-27 12:36:29.175183: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:29.176735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory: -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
DEEPMD INFO
DEEPMD INFO stage 1: compress the model
WARNING:tensorflow:Your environment has TF_USE_LEGACY_KERAS set to True, but you do not have the tf_keras package installed. You must install it in order to use the legacy tf.keras. Install it via: `pip install tf_keras`
WARNING:tensorflow:Your environment has TF_USE_LEGACY_KERAS set to True, but you do not have the tf_keras package installed. You must install it in order to use the legacy tf.keras. Install it via: `pip install tf_keras`
DEEPMD WARNING Switch to serial execution due to lack of horovod module.
DEEPMD INFO _____ _____ __ __ _____ _ _ _
DEEPMD INFO | __ \ | __ \ | \/ || __ \ | | (_)| |
DEEPMD INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |_
DEEPMD INFO | | | | / _ \ / _ \| ___/ | |\/| || | | ||______|| |/ /| || __|
DEEPMD INFO | |__| || __/| __/| | | | | || |__| | | < | || |_
DEEPMD INFO |_____/ \___| \___||_| |_| |_||_____/ |_|\_\|_| \__|
DEEPMD INFO Please read and cite:
DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD INFO Zeng et al, J. Chem. Phys., 159, 054801 (2023)
DEEPMD INFO See https://deepmd.rtfd.io/credits/ for details.
DEEPMD INFO installed to: /usr/local
DEEPMD INFO source :
DEEPMD INFO source brach: HEAD
DEEPMD INFO source commit: e08ccaf
DEEPMD INFO source commit at: 2024-04-06 23:39:30 +0000
DEEPMD INFO build float prec: double
DEEPMD INFO build variant: cuda
DEEPMD INFO build with tf inc: /opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/include/;/opt/software/deepmd-kit/v2/include
DEEPMD INFO build with tf lib:
DEEPMD INFO ---Summary of the training---------------------------------------
DEEPMD INFO running on: node
DEEPMD INFO computing device: gpu:0
DEEPMD INFO CUDA_VISIBLE_DEVICES: unset
DEEPMD INFO Count of visible GPU: 2
DEEPMD INFO num_intra_threads: 0
DEEPMD INFO num_inter_threads: 0
DEEPMD INFO -----------------------------------------------------------------
2024-06-27 12:36:33.411884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:33.413405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory: -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
DEEPMD INFO training without frame parameter
2024-06-27 12:36:33.509570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:33.511142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory: -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-06-27 12:36:33.525433: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:33.526960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory: -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-06-27 12:36:33.574807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:33.576330: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory: -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-06-27 12:36:33.619329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:33.620883: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory: -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
DEEPMD INFO training data with lower boundary: [-0.36005299 -0.38848973]
DEEPMD INFO training data with upper boundary: [7.68739894 8.69554 ]
2024-06-27 12:36:34.737253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:34.744968: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory: -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-06-27 12:36:34.825751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:34.827318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory: -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
2024-06-27 12:36:34.874729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
2024-06-27 12:36:34.876301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79087 MB memory: -> device: 1, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:8b:00.0, compute capability: 8.0
DEEPMD INFO built lr
DEEPMD INFO built network
DEEPMD INFO built training
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2024-06-27 12:36:35.944095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79087 MB memory: -> device: 0, name: NVIDIA A100-SXM4-80GB, pci bus id: 0000:88:00.0, compute capability: 8.0
DEEPMD INFO initialize model from scratch
DEEPMD INFO finished compressing
DEEPMD INFO
DEEPMD INFO stage 2: freeze the model
Traceback (most recent call last):
File "/opt/software/deepmd-kit/v2/bin/dp", line 10, in <module>
sys.exit(main())
^^^^^^
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/deepmd_utils/main.py", line 657, in main
deepmd_main(args)
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/deepmd/entrypoints/main.py", line 82, in main
compress(**dict_args)
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/deepmd/entrypoints/compress.py", line 176, in compress
freeze(checkpoint_folder=checkpoint_folder, output=output, node_names=None)
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/deepmd/entrypoints/freeze.py", line 518, in freeze
import horovod.tensorflow as HVD
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/horovod/tensorflow/__init__.py", line 44, in <module>
from horovod.tensorflow.sync_batch_norm import SyncBatchNormalization
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/horovod/tensorflow/sync_batch_norm.py", line 22, in <module>
class SyncBatchNormalization(tf.keras.layers.BatchNormalization):
^^^^^^^^^^^^^^^
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 170, in __getattr__
module = self._load()
^^^^^^^^^^^^
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 50, in _load
module = importlib.import_module(self.__name__)
^^^^^^^^^^^^^
...(skipped repeated lines)...
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 50, in _load
module = importlib.import_module(self.__name__)
^^^^^^^^^^^^^
File "/opt/software/deepmd-kit/v2/lib/python3.11/site-packages/tensorflow/python/util/lazy_loader.py", line 143, in __getattr__
if item in ("_mode", "_initialized", "_name"):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RecursionError: maximum recursion depth exceeded in comparison
Bug summary
When I run
dp compress -i graph.pb -o graph-compress.pb
an error occurs:Using offline package deepmd-kit-2.2.10-cuda124-Linux-x86_64 with TensorFlow 2.15.0 and CUDA 12.4. No other modules are installed. Bug occurs when following the official tutorial.
DeePMD-kit Version
deepmd-kit-2.2.10-cuda124
Backend and its version
TensorFlow v2.15.0
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
Steps to Reproduce
wget https://dp-public.oss-cn-beijing.aliyuncs.com/community/DeePMD-kit-FastLearn.tar tar xvf DeePMD-kit-FastLearn.tar cd DeePMD-kit-FastLearn/01.train dp train input.json dp freeze -o graph.pb dp compress -i graph.pb -o graph-compress.pb
Further Information, Files, and Links
There are no such error with deepmd-kit-2.2.10-cuda118-Linux-x86_64 (TensorFlow 2.14 and CUDA 11.8).