deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.45k stars 499 forks source link

[BUG] The same input file, training on AMD9654+4090 reports errors, on intel8383C+4090 can be submitted correctly #4042

Closed lue611 closed 1 month ago

lue611 commented 1 month ago

Bug summary

System version ubuntu-22.04

Training the model using dpgen and submitting it on an AMD9654+4090 shows the following error message and using nvidia-smi reveals that the graphics card has no tasks DeepModeling Version: 0.12.1 Path: /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen

Dependency numpy 1.22.3 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/numpy dpdata 0.2.18 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdata pymatgen unknown version or path monty 2024.3.31 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/monty ase 3.22.1 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/ase paramiko 3.4.0 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/paramiko custodian 2024.3.12 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/custodian

Reference Please cite: Yuzhi Zhang, Haidi Wang, Weijie Chen, Jinzhe Zeng, Linfeng Zhang, Han Wang, and Weinan E, DP-GEN: A concurrent learning platform for the generation of reliable deep learning based potential energy models, Computer Physics Communications, 2020, 107206. Description INFO:dpgen:-------------------------iter.000000 task 01-------------------------- 2024-07-27 17:25:54,605 - INFO : info:check_all_finished: False 2024-07-27 17:25:55,907 - INFO : job: a58884f446c397acd7f621a7f1b97585f327345a submit; job_id is 2712829 2024-07-27 17:27:58,236 - INFO : job: a58884f446c397acd7f621a7f1b97585f327345a 2712829 terminated; fail_cout is 1; resubmitting job 2024-07-27 17:27:58,443 - INFO : job:a58884f446c397acd7f621a7f1b97585f327345a re-submit after terminated; new job_id is 2713462 2024-07-27 17:27:58,856 - INFO : job:a58884f446c397acd7f621a7f1b97585f327345a job_id:2713462 after re-submitting; the state now is <JobStatus.running: 3> 2024-07-27 17:28:59,356 - INFO : job: a58884f446c397acd7f621a7f1b97585f327345a 2713462 terminated; fail_cout is 2; resubmitting job 2024-07-27 17:28:59,498 - INFO : job:a58884f446c397acd7f621a7f1b97585f327345a re-submit after terminated; new job_id is 2714035 2024-07-27 17:28:59,913 - INFO : job:a58884f446c397acd7f621a7f1b97585f327345a job_id:2714035 after re-submitting; the state now is <JobStatus.running: 3> 2024-07-27 17:30:00,419 - INFO : job: a58884f446c397acd7f621a7f1b97585f327345a 2714035 terminated; fail_cout is 3; resubmitting job Traceback (most recent call last): File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 358, in handle_unexpected_submission_state job.handle_unexpected_job_state() File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 862, in handle_unexpected_job_state raise RuntimeError(err_msg) RuntimeError: job:a58884f446c397acd7f621a7f1b97585f327345a 2714035 failed 3 times. Possible remote error message: ==> /home/ps/Desktop/linlve/uo2dp/data/0d649e7e0b63ebcd14b5468adf1b273c876a1f73/000/train.log <== __.py", line 11, in import deepmd.utils.network as network File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/init.py", line 9, in from .learning_rate import ( File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/learning_rate.py", line 8, in from deepmd.env import ( File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 476, in op_module = get_module("deepmd_op") File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 447, in get_module raise RuntimeError(error_message) from e RuntimeError: This deepmd-kit package is inconsitent with TensorFlow Runtime, thus an error is raised when loading deepmd_op. You need to rebuild deepmd-kit against this TensorFlow runtime. WARNING: devtoolset on RHEL6 and RHEL7 does not support _GLIBCXX_USE_CXX11_ABI=1. See https://bugzilla.redhat.com/show_bug.cgi?id=1546704

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/ps/miniconda3/envs/deepmd/bin/dpgen", line 10, in sys.exit(main()) File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/main.py", line 255, in main args.func(args) File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/generator/run.py", line 5394, in gen_run run_iter(args.PARAM, args.MACHINE) File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/generator/run.py", line 4725, in run_iter run_train(ii, jdata, mdata) File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/generator/run.py", line 868, in run_train submission.run_submission() File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 261, in run_submission self.handle_unexpected_submission_state() File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 362, in handle_unexpected_submission_state raise RuntimeError( RuntimeError: Meet errors will handle unexpected submission state. Debug information: remote_root==/home/ps/Desktop/linlve/uo2dp/data/0d649e7e0b63ebcd14b5468adf1b273c876a1f73. Debug information: submission_hash==0d649e7e0b63ebcd14b5468adf1b273c876a1f73. Please check error messages above and in remote_root. The submission information is saved in /home/ps/.dpdispatcher/submission/0d649e7e0b63ebcd14b5468adf1b273c876a1f73.json. For furthur actions, run the following command with proper flags: dpdisp submission 0d649e7e0b63ebcd14b5468adf1b273c876a1f73

nvidia-smi shows the GPU has no tasks: +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 On | 00000000:41:00.0 Off | Off | | 34% 29C P8 18W / 450W | 14MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 33301 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------------------+

Normal computation on intel8383C+4090, using nvidia-smi found the graphics card to have tasks

DP-GEN Version

12.1

Platform, Python Version, Remote Platform, etc

No response

Input Files, Running Commands, Error Log, etc.

/

Steps to Reproduce

/

Further Information, Files, and Links

No response

njzjz commented 1 month ago

Please post the entire error message in /home/ps/Desktop/linlve/uo2dp/data/0d649e7e0b63ebcd14b5468adf1b273c876a1f73/000/train.log.

lue611 commented 1 month ago

Hi, this is the error in train.log

WARNING:tensorflow:From /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. Traceback (most recent call last): File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 396, in get_module module = tf.load_op_library(str(module_file)) File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/load_library.py", line 54, in load_op_library lib_handle = py_tf.TF_LoadLibrary(library_filename) tensorflow.python.framework.errors_impl.NotFoundError: /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/lib/libdeepmd_op.so: undefined symbol: ZN6deepmd6RegionIfEC1EPfS2

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/ps/miniconda3/envs/deepmd/bin/dp", line 10, in sys.exit(main()) File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd_utils/main.py", line 655, in main from deepmd.entrypoints.main import main as deepmd_main File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/init.py", line 11, in import deepmd.utils.network as network File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/init.py", line 9, in from .learning_rate import ( File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/learning_rate.py", line 8, in from deepmd.env import ( File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 476, in op_module = get_module("deepmd_op") File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 447, in get_module raise RuntimeError(error_message) from e RuntimeError: This deepmd-kit package is inconsitent with TensorFlow Runtime, thus an error is raised when loading deepmd_op. You need to rebuild deepmd-kit against this TensorFlow runtime. WARNING: devtoolset on RHEL6 and RHEL7 does not support _GLIBCXX_USE_CXX11_ABI=1. See https://bugzilla.redhat.com/show_bug.cgi?id=1546704 WARNING:tensorflow:From /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. Traceback (most recent call last): File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 396, in get_module module = tf.load_op_library(str(module_file)) File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/load_library.py", line 54, in load_op_library lib_handle = py_tf.TF_LoadLibrary(library_filename) tensorflow.python.framework.errors_impl.NotFoundError: /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/lib/libdeepmd_op.so: undefined symbol: ZN6deepmd6RegionIfEC1EPfS2

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/ps/miniconda3/envs/deepmd/bin/dp", line 10, in sys.exit(main()) File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd_utils/main.py", line 655, in main from deepmd.entrypoints.main import main as deepmd_main File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/init.py", line 11, in import deepmd.utils.network as network File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/init.py", line 9, in from .learning_rate import ( File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/learning_rate.py", line 8, in from deepmd.env import ( File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 476, in op_module = get_module("deepmd_op") File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 447, in get_module raise RuntimeError(error_message) from e RuntimeError: This deepmd-kit package is inconsitent with TensorFlow Runtime, thus an error is raised when loading deepmd_op. You need to rebuild deepmd-kit against this TensorFlow runtime. WARNING: devtoolset on RHEL6 and RHEL7 does not support _GLIBCXX_USE_CXX11_ABI=1. See https://bugzilla.redhat.com/show_bug.cgi?id=1546704 WARNING:tensorflow:From /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information. Traceback (most recent call last): File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 396, in get_module module = tf.load_op_library(str(module_file)) File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/load_library.py", line 54, in load_op_library lib_handle = py_tf.TF_LoadLibrary(library_filename) tensorflow.python.framework.errors_impl.NotFoundError: /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/lib/libdeepmd_op.so: undefined symbol: ZN6deepmd6RegionIfEC1EPfS2

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/ps/miniconda3/envs/deepmd/bin/dp", line 10, in sys.exit(main()) File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd_utils/main.py", line 655, in main from deepmd.entrypoints.main import main as deepmd_main File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/init.py", line 11, in import deepmd.utils.network as network File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/init.py", line 9, in from .learning_rate import ( File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/learning_rate.py", line 8, in from deepmd.env import ( File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 476, in op_module = get_module("deepmd_op") File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 447, in get_module raise RuntimeError(error_message) from e RuntimeError: This deepmd-kit package is inconsitent with TensorFlow Runtime, thus an error is raised when loading deepmd_op. You need to rebuild deepmd-kit against this TensorFlow runtime. WARNING: devtoolset on RHEL6 and RHEL7 does not support _GLIBCXX_USE_CXX11_ABI=1. See https://bugzilla.redhat.com/show_bug.cgi?id=1546704

njzjz commented 1 month ago

tensorflow.python.framework.errors_impl.NotFoundError: /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/lib/libdeepmd_op.so: undefined symbol: _ZN6deepmd6RegionIfEC1EPfS2_

How did you install deepmd-kit? What is your compiler?

Could you post the output of the following command

nm -D /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/lib/libdeepmd_op.so
ldd /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/lib/libdeepmd_op.so
njzjz commented 1 month ago

I moved the issue to deepmd-kit as it seems more related.

lue611 commented 1 month ago

Hi For deepmd-kit installation, I use the following command in '1.1.2. Install with conda':

conda create -n deepmd deepmd-kit==gpu libdeepmd==gpu lammps cudatoolkit=11.6 horovod -c https://conda.deepmodeling.com -c defaults

For dpgen installation, I use:

pip install dpgen

On 9654 and 8383C machine, I use the same command to install.

‘nm’ command outputs more than 1000 lines, so I write the outputs in the file ldd.txt nm.txt

For compiler, I'm not good at computer science. If these are not the message you need, I apologize. I use 'uname -a' and output:

Linux control 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

njzjz commented 1 month ago

For deepmd-kit installation, I use the following command in '1.1.2. Install with conda':

I think I got the reason. Please use conda list to check if deepmd-kit and libdeepmd have the same version. If not for any reason, considering removing and reinstalling libdeepmd.

lue611 commented 1 month ago

Thank you for your kindly help, you are right, the version of these 2 are 2.2.7 and 2.2.10. I reinstall deepmd and the problem solved