deepmodeling / dpgen

The deep potential generator to generate a deep-learning based model of interatomic potential energy and force field
https://docs.deepmodeling.com/projects/dpgen/
GNU Lesser General Public License v3.0
296 stars 173 forks source link

[BUG] 同样的文件,在AMD9654+4090上训练报错,在intel8383C+4090上能正常提交 #1602

Closed lue611 closed 1 month ago

lue611 commented 1 month ago

Bug summary

系统版本ubuntu-22.04

使用dpgen训练模型,在AMD9654+4090上提交,显示如下的报错信息,并且使用nvidia-smi发现显卡没有任务

DeepModeling

Version: 0.12.1 Path: /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen

Dependency

 numpy     1.22.3   /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/numpy
dpdata     0.2.18   /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdata

pymatgen unknown version or path monty 2024.3.31 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/monty ase 3.22.1 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/ase paramiko 3.4.0 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/paramiko custodian 2024.3.12 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/custodian

Reference

Please cite: Yuzhi Zhang, Haidi Wang, Weijie Chen, Jinzhe Zeng, Linfeng Zhang, Han Wang, and Weinan E, DP-GEN: A concurrent learning platform for the generation of reliable deep learning based potential energy models, Computer Physics Communications, 2020, 107206.

Description

INFO:dpgen:-------------------------iter.000000 task 01-------------------------- 2024-07-27 17:25:54,605 - INFO : info:check_all_finished: False 2024-07-27 17:25:55,907 - INFO : job: a58884f446c397acd7f621a7f1b97585f327345a submit; job_id is 2712829 2024-07-27 17:27:58,236 - INFO : job: a58884f446c397acd7f621a7f1b97585f327345a 2712829 terminated; fail_cout is 1; resubmitting job 2024-07-27 17:27:58,443 - INFO : job:a58884f446c397acd7f621a7f1b97585f327345a re-submit after terminated; new job_id is 2713462 2024-07-27 17:27:58,856 - INFO : job:a58884f446c397acd7f621a7f1b97585f327345a job_id:2713462 after re-submitting; the state now is <JobStatus.running: 3> 2024-07-27 17:28:59,356 - INFO : job: a58884f446c397acd7f621a7f1b97585f327345a 2713462 terminated; fail_cout is 2; resubmitting job 2024-07-27 17:28:59,498 - INFO : job:a58884f446c397acd7f621a7f1b97585f327345a re-submit after terminated; new job_id is 2714035 2024-07-27 17:28:59,913 - INFO : job:a58884f446c397acd7f621a7f1b97585f327345a job_id:2714035 after re-submitting; the state now is <JobStatus.running: 3> 2024-07-27 17:30:00,419 - INFO : job: a58884f446c397acd7f621a7f1b97585f327345a 2714035 terminated; fail_cout is 3; resubmitting job Traceback (most recent call last): File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 358, in handle_unexpected_submission_state job.handle_unexpected_job_state() File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 862, in handle_unexpected_job_state raise RuntimeError(err_msg) RuntimeError: job:a58884f446c397acd7f621a7f1b97585f327345a 2714035 failed 3 times. Possible remote error message: ==> /home/ps/Desktop/linlve/uo2dp/data/0d649e7e0b63ebcd14b5468adf1b273c876a1f73/000/train.log <== .py", line 11, in import deepmd.utils.network as network File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/init__.py", line 9, in from .learning_rate import ( File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/learning_rate.py", line 8, in from deepmd.env import ( File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 476, in op_module = get_module("deepmd_op") File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 447, in get_module raise RuntimeError(error_message) from e RuntimeError: This deepmd-kit package is inconsitent with TensorFlow Runtime, thus an error is raised when loading deepmd_op. You need to rebuild deepmd-kit against this TensorFlow runtime. WARNING: devtoolset on RHEL6 and RHEL7 does not support _GLIBCXX_USE_CXX11_ABI=1. See https://bugzilla.redhat.com/show_bug.cgi?id=1546704

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/ps/miniconda3/envs/deepmd/bin/dpgen", line 10, in sys.exit(main()) File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/main.py", line 255, in main args.func(args) File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/generator/run.py", line 5394, in gen_run run_iter(args.PARAM, args.MACHINE) File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/generator/run.py", line 4725, in run_iter run_train(ii, jdata, mdata) File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/generator/run.py", line 868, in run_train submission.run_submission() File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 261, in run_submission self.handle_unexpected_submission_state() File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 362, in handle_unexpected_submission_state raise RuntimeError( RuntimeError: Meet errors will handle unexpected submission state. Debug information: remote_root==/home/ps/Desktop/linlve/uo2dp/data/0d649e7e0b63ebcd14b5468adf1b273c876a1f73. Debug information: submission_hash==0d649e7e0b63ebcd14b5468adf1b273c876a1f73. Please check error messages above and in remote_root. The submission information is saved in /home/ps/.dpdispatcher/submission/0d649e7e0b63ebcd14b5468adf1b273c876a1f73.json. For furthur actions, run the following command with proper flags: dpdisp submission 0d649e7e0b63ebcd14b5468adf1b273c876a1f73

nvidia-smi显示信息如下,显卡没有任务 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 On | 00000000:41:00.0 Off | Off | | 34% 29C P8 18W / 450W | 14MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 33301 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------------------+

在intel8383C+4090上正常计算,使用nvidia-smi发现显卡有任务

DP-GEN Version

12.1

Platform, Python Version, Remote Platform, etc

No response

Input Files, Running Commands, Error Log, etc.

/

Steps to Reproduce

/

Further Information, Files, and Links

No response

njzjz commented 1 month ago

We don't allow non-English issues.