pymatgen unknown version or path
monty 2024.3.31 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/monty
ase 3.22.1 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/ase
paramiko 3.4.0 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/paramiko
custodian 2024.3.12 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/custodian
Reference
Please cite:
Yuzhi Zhang, Haidi Wang, Weijie Chen, Jinzhe Zeng, Linfeng Zhang, Han Wang, and Weinan E,
DP-GEN: A concurrent learning platform for the generation of reliable deep learning
based potential energy models, Computer Physics Communications, 2020, 107206.
Description
INFO:dpgen:-------------------------iter.000000 task 01--------------------------
2024-07-27 17:25:54,605 - INFO : info:check_all_finished: False
2024-07-27 17:25:55,907 - INFO : job: a58884f446c397acd7f621a7f1b97585f327345a submit; job_id is 2712829
2024-07-27 17:27:58,236 - INFO : job: a58884f446c397acd7f621a7f1b97585f327345a 2712829 terminated; fail_cout is 1; resubmitting job
2024-07-27 17:27:58,443 - INFO : job:a58884f446c397acd7f621a7f1b97585f327345a re-submit after terminated; new job_id is 2713462
2024-07-27 17:27:58,856 - INFO : job:a58884f446c397acd7f621a7f1b97585f327345a job_id:2713462 after re-submitting; the state now is <JobStatus.running: 3>
2024-07-27 17:28:59,356 - INFO : job: a58884f446c397acd7f621a7f1b97585f327345a 2713462 terminated; fail_cout is 2; resubmitting job
2024-07-27 17:28:59,498 - INFO : job:a58884f446c397acd7f621a7f1b97585f327345a re-submit after terminated; new job_id is 2714035
2024-07-27 17:28:59,913 - INFO : job:a58884f446c397acd7f621a7f1b97585f327345a job_id:2714035 after re-submitting; the state now is <JobStatus.running: 3>
2024-07-27 17:30:00,419 - INFO : job: a58884f446c397acd7f621a7f1b97585f327345a 2714035 terminated; fail_cout is 3; resubmitting job
Traceback (most recent call last):
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 358, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 862, in handle_unexpected_job_state
raise RuntimeError(err_msg)
RuntimeError: job:a58884f446c397acd7f621a7f1b97585f327345a 2714035 failed 3 times.
Possible remote error message: ==> /home/ps/Desktop/linlve/uo2dp/data/0d649e7e0b63ebcd14b5468adf1b273c876a1f73/000/train.log <==
.py", line 11, in
import deepmd.utils.network as network
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/init__.py", line 9, in
from .learning_rate import (
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/learning_rate.py", line 8, in
from deepmd.env import (
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 476, in
op_module = get_module("deepmd_op")
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 447, in get_module
raise RuntimeError(error_message) from e
RuntimeError: This deepmd-kit package is inconsitent with TensorFlow Runtime, thus an error is raised when loading deepmd_op. You need to rebuild deepmd-kit against this TensorFlow runtime.
WARNING: devtoolset on RHEL6 and RHEL7 does not support _GLIBCXX_USE_CXX11_ABI=1. See https://bugzilla.redhat.com/show_bug.cgi?id=1546704
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ps/miniconda3/envs/deepmd/bin/dpgen", line 10, in
sys.exit(main())
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/main.py", line 255, in main
args.func(args)
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/generator/run.py", line 5394, in gen_run
run_iter(args.PARAM, args.MACHINE)
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/generator/run.py", line 4725, in run_iter
run_train(ii, jdata, mdata)
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/generator/run.py", line 868, in run_train
submission.run_submission()
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 261, in run_submission
self.handle_unexpected_submission_state()
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 362, in handle_unexpected_submission_state
raise RuntimeError(
RuntimeError: Meet errors will handle unexpected submission state.
Debug information: remote_root==/home/ps/Desktop/linlve/uo2dp/data/0d649e7e0b63ebcd14b5468adf1b273c876a1f73.
Debug information: submission_hash==0d649e7e0b63ebcd14b5468adf1b273c876a1f73.
Please check error messages above and in remote_root. The submission information is saved in /home/ps/.dpdispatcher/submission/0d649e7e0b63ebcd14b5468adf1b273c876a1f73.json.
For furthur actions, run the following command with proper flags: dpdisp submission 0d649e7e0b63ebcd14b5468adf1b273c876a1f73
nvidia-smi显示信息如下,显卡没有任务
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:41:00.0 Off | Off |
| 34% 29C P8 18W / 450W | 14MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 33301 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------------------+
Bug summary
系统版本ubuntu-22.04
使用dpgen训练模型,在AMD9654+4090上提交,显示如下的报错信息,并且使用nvidia-smi发现显卡没有任务
DeepModeling
Version: 0.12.1 Path: /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen
Dependency
pymatgen unknown version or path monty 2024.3.31 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/monty ase 3.22.1 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/ase paramiko 3.4.0 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/paramiko custodian 2024.3.12 /home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/custodian
Reference
Please cite: Yuzhi Zhang, Haidi Wang, Weijie Chen, Jinzhe Zeng, Linfeng Zhang, Han Wang, and Weinan E, DP-GEN: A concurrent learning platform for the generation of reliable deep learning based potential energy models, Computer Physics Communications, 2020, 107206.
Description
INFO:dpgen:-------------------------iter.000000 task 01-------------------------- 2024-07-27 17:25:54,605 - INFO : info:check_all_finished: False 2024-07-27 17:25:55,907 - INFO : job: a58884f446c397acd7f621a7f1b97585f327345a submit; job_id is 2712829 2024-07-27 17:27:58,236 - INFO : job: a58884f446c397acd7f621a7f1b97585f327345a 2712829 terminated; fail_cout is 1; resubmitting job 2024-07-27 17:27:58,443 - INFO : job:a58884f446c397acd7f621a7f1b97585f327345a re-submit after terminated; new job_id is 2713462 2024-07-27 17:27:58,856 - INFO : job:a58884f446c397acd7f621a7f1b97585f327345a job_id:2713462 after re-submitting; the state now is <JobStatus.running: 3> 2024-07-27 17:28:59,356 - INFO : job: a58884f446c397acd7f621a7f1b97585f327345a 2713462 terminated; fail_cout is 2; resubmitting job 2024-07-27 17:28:59,498 - INFO : job:a58884f446c397acd7f621a7f1b97585f327345a re-submit after terminated; new job_id is 2714035 2024-07-27 17:28:59,913 - INFO : job:a58884f446c397acd7f621a7f1b97585f327345a job_id:2714035 after re-submitting; the state now is <JobStatus.running: 3> 2024-07-27 17:30:00,419 - INFO : job: a58884f446c397acd7f621a7f1b97585f327345a 2714035 terminated; fail_cout is 3; resubmitting job Traceback (most recent call last): File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 358, in handle_unexpected_submission_state job.handle_unexpected_job_state() File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 862, in handle_unexpected_job_state raise RuntimeError(err_msg) RuntimeError: job:a58884f446c397acd7f621a7f1b97585f327345a 2714035 failed 3 times. Possible remote error message: ==> /home/ps/Desktop/linlve/uo2dp/data/0d649e7e0b63ebcd14b5468adf1b273c876a1f73/000/train.log <== .py", line 11, in
import deepmd.utils.network as network
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/ init__.py", line 9, in
from .learning_rate import (
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/learning_rate.py", line 8, in
from deepmd.env import (
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 476, in
op_module = get_module("deepmd_op")
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/env.py", line 447, in get_module
raise RuntimeError(error_message) from e
RuntimeError: This deepmd-kit package is inconsitent with TensorFlow Runtime, thus an error is raised when loading deepmd_op. You need to rebuild deepmd-kit against this TensorFlow runtime.
WARNING: devtoolset on RHEL6 and RHEL7 does not support _GLIBCXX_USE_CXX11_ABI=1. See https://bugzilla.redhat.com/show_bug.cgi?id=1546704
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/home/ps/miniconda3/envs/deepmd/bin/dpgen", line 10, in
sys.exit(main())
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/main.py", line 255, in main
args.func(args)
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/generator/run.py", line 5394, in gen_run
run_iter(args.PARAM, args.MACHINE)
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/generator/run.py", line 4725, in run_iter
run_train(ii, jdata, mdata)
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/generator/run.py", line 868, in run_train
submission.run_submission()
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 261, in run_submission
self.handle_unexpected_submission_state()
File "/home/ps/miniconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 362, in handle_unexpected_submission_state
raise RuntimeError(
RuntimeError: Meet errors will handle unexpected submission state.
Debug information: remote_root==/home/ps/Desktop/linlve/uo2dp/data/0d649e7e0b63ebcd14b5468adf1b273c876a1f73.
Debug information: submission_hash==0d649e7e0b63ebcd14b5468adf1b273c876a1f73.
Please check error messages above and in remote_root. The submission information is saved in /home/ps/.dpdispatcher/submission/0d649e7e0b63ebcd14b5468adf1b273c876a1f73.json.
For furthur actions, run the following command with proper flags: dpdisp submission 0d649e7e0b63ebcd14b5468adf1b273c876a1f73
nvidia-smi显示信息如下,显卡没有任务 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 On | 00000000:41:00.0 Off | Off | | 34% 29C P8 18W / 450W | 14MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 33301 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------------------+
在intel8383C+4090上正常计算,使用nvidia-smi发现显卡有任务
DP-GEN Version
12.1
Platform, Python Version, Remote Platform, etc
No response
Input Files, Running Commands, Error Log, etc.
/
Steps to Reproduce
/
Further Information, Files, and Links
No response