Closed 343333333 closed 3 years ago
See #372
thx for replying , i fix this problem ,but meet another, it seems like the problem of dp.(btw, the snapshot of my initdata is only 3 ,is that allright?) and i check the log file from the mission it says
# DEEPMD: ---Summary of DataSystem------------------------------------------------
# DEEPMD: found 1 system(s):
# DEEPMD: system natoms bch_sz n_bch n_test prob
# DEEPMD: ../data.init/deepmd 4 3 1 2 1.000
# DEEPMD: ------------------------------------------------------------------------
# DEEPMD:
# DEEPMD: training without frame parameter
Traceback (most recent call last):
File "/home/ben/desktop/1/yes/bin/dp", line 10, in <module>
sys.exit(main())
File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/deepmd/main.py", line 73, in main
train(args)
File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/deepmd/train.py", line 87, in train
_do_work(jdata, run_opt)
File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/deepmd/train.py", line 140, in _do_work
model.build (data, stop_batch)
File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/deepmd/Trainer.py", line 227, in build
self.model.data_stat(data)
File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/deepmd/Model.py", line 115, in data_stat
self._compute_input_stat(m_all_stat, protection = self.data_stat_protect)
File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/deepmd/Model.py", line 120, in _compute_input_stat
self.descrpt.compute_input_stats(all_stat['coord'],
File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/deepmd/DescrptSeA.py", line 121, in compute_input_stats
= self._compute_dstats_sys_smth(cc,bb,tt,nn,mm)
File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/deepmd/DescrptSeA.py", line 284, in _compute_dstats_sys_smth
= self.sub_sess.run(self.stat_descrpt,
File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 957, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1154, in _run
raise ValueError(
ValueError: Cannot feed value of shape (3,) for Tensor 'd_sea_t_natoms:0', which has shape '(4,)'
and this is the dpgen's log
Description
------------
INFO:dpgen:-------------------------iter.000000 task 01--------------------------
INFO:dpgen:new submission of 0972aa4f-dad6-496e-b8d2-28c618009df2 for chunk 8aefb06c426e07a0a671a1e2488b4858d694a730
INFO:dpgen:new submission of ad2bc6bf-826b-45e7-b5e9-27348968f9df for chunk e193a01ecf8d30ad0affefd332ce934e32ffce72
INFO:dpgen:job 0972aa4f-dad6-496e-b8d2-28c618009df2 terminated, submit again
INFO:dpgen:job ad2bc6bf-826b-45e7-b5e9-27348968f9df terminated, submit again
INFO:dpgen:job 0972aa4f-dad6-496e-b8d2-28c618009df2 terminated, submit again
INFO:dpgen:job ad2bc6bf-826b-45e7-b5e9-27348968f9df terminated, submit again
INFO:dpgen:job 0972aa4f-dad6-496e-b8d2-28c618009df2 terminated, submit again
INFO:dpgen:job ad2bc6bf-826b-45e7-b5e9-27348968f9df terminated, submit again
Traceback (most recent call last):
File "/home/ben/.local/bin/dpgen", line 8, in <module>
sys.exit(main())
File "/home/ben/.local/lib/python3.8/site-packages/dpgen/main.py", line 175, in main
args.func(args)
File "/home/ben/.local/lib/python3.8/site-packages/dpgen/generator/run.py", line 2410, in gen_run
run_iter (args.PARAM, args.MACHINE)
File "/home/ben/.local/lib/python3.8/site-packages/dpgen/generator/run.py", line 2373, in run_iter
run_train (ii, jdata, mdata)
File "/home/ben/.local/lib/python3.8/site-packages/dpgen/generator/run.py", line 529, in run_train
dispatcher.run_jobs(mdata['train_resources'],
File "/home/ben/.local/lib/python3.8/site-packages/dpgen/dispatcher/Dispatcher.py", line 91, in run_jobs
while not self.all_finished(job_handler, mark_failure) :
File "/home/ben/.local/lib/python3.8/site-packages/dpgen/dispatcher/Dispatcher.py", line 216, in all_finished
raise RuntimeError('Job %s failed for more than 3 times' % job_uuid)
RuntimeError: Job 0972aa4f-dad6-496e-b8d2-28c618009df2 failed for more than 3 times
my machine.json is
{
"train": [
{
"command": "/home/ben/desktop/1/yes/bin/dp",
"machine": {
"batch": "shell",
"_hostname": "localhost",
"_port" : 22,
"username" :"ben" ,
"work_path": "/home/ben/desktop/work/dpgen/test3/temp"
},
"resources": {
"numb_gpu": 0,
"numb_node" :1 ,
"task_per_node": 2,
"partition": "cpu",
"exclude_list": [],
"mem_limit": 8,
"source_list": [],
"module_list": []
}
}
],
"model_devi": [
{
"machine": {
"batch": "shell",
"work_path": "/home/ben/desktop/work/dpgen/test3/temp"
},
"resources": {
"numb_gpu": 0,
"task_per_node": 4,
"partition": "cpu",
"exclude_list": [],
"mem_limit": 8,
"source_list": [],
"module_list": []
},
"command": " ~/desktop/1/lammps/src/lmp_mpi",
"group_size": 1
}
],
"fp": [
{
"machine": {
"batch": "shell",
"work_path": "/home/ben/desktop/work/dpgen/test3/temp"
},
"resources": {
"numb_gpu": 0,
"task_per_node": 4,
"with_mpi": false,
"source_list": ["/home/ben/intel/parallel_studio_xe_2019.5.075/psxevars.sh"],
"module_list": [],
"partition": "cpu",
"_envs" : {"PATH" : "/root/vasp/bin:$PATH"}
},
"command": "ulimit -s unlimited && mpirun -n 4 /home/ben/desktop/1/vasp.5.4.4/bin/vasp",
"group_size": 30
}
]
}
oh i try it out . the data set is too small . i use a bigger one and it work well so far .
This issue is solved. I've closed this issue. If there is still any problem, you can reopen this issue or create a new issue.
python version:3.8.5 deepmd-kit version :1.x dpgen version : 0.9.3.dev9+g00432d2
problem describe: when i try running the example in dpgen-master/tests/generator , the error happens like this: it seems like the jinput go somtething wrong, but i didnt change the param-mg-vasp.json file as a input. So, anyone could tell me how to fix that ,thx.
Description
the log file : 2021-05-06 13:49:57,770 - INFO : start running 2021-05-06 13:49:57,771 - INFO : =============================iter.000000============================== 2021-05-06 13:49:57,771 - INFO : -------------------------iter.000000 task 00-------------------------- (but i can use standalone dp to train a model )
and here is my machine config file :