deepmodeling / dpgen

The deep potential generator to generate a deep-learning based model of interatomic potential energy and force field
https://docs.deepmodeling.com/projects/dpgen/
GNU Lesser General Public License v3.0
296 stars 173 forks source link

problem when run dpgen test in local system #398

Closed 343333333 closed 3 years ago

343333333 commented 3 years ago

python version:3.8.5 deepmd-kit version :1.x dpgen version : 0.9.3.dev9+g00432d2

problem describe: when i try running the example in dpgen-master/tests/generator , the error happens like this: it seems like the jinput go somtething wrong, but i didnt change the param-mg-vasp.json file as a input. So, anyone could tell me how to fix that ,thx.

Description

------------
Traceback (most recent call last):
  File "/home/ben/.local/bin/dpgen", line 8, in <module>
    sys.exit(main())
  File "/home/ben/.local/lib/python3.8/site-packages/dpgen/main.py", line 175, in main
    args.func(args)
  File "/home/ben/.local/lib/python3.8/site-packages/dpgen/generator/run.py", line 2410, in gen_run
    run_iter (args.PARAM, args.MACHINE)
  File "/home/ben/.local/lib/python3.8/site-packages/dpgen/generator/run.py", line 2369, in run_iter
    make_train (ii, jdata, mdata)
  File "/home/ben/.local/lib/python3.8/site-packages/dpgen/generator/run.py", line 312, in make_train
    jinput['training']['systems'] = init_data_sys
KeyError: 'training'

the log file : 2021-05-06 13:49:57,770 - INFO : start running 2021-05-06 13:49:57,771 - INFO : =============================iter.000000============================== 2021-05-06 13:49:57,771 - INFO : -------------------------iter.000000 task 00-------------------------- (but i can use standalone dp to train a model )

and here is my machine config file :


{
"train": [
    {
    "machine": {
        "batch": "shell",
        "work_path": "/home/ben/desktop/work/dpgen/test2/temp"
        },
    "resources": {
        "numb_gpu": 0,
        "task_per_node": 8,
        "partition": "cpu",
        "exclude_list": [],
        "mem_limit": 8,
        "source_list": [],
        "module_list": []
        },
    "command": "/home/ben/desktop/1/yes/bin/dp",
    "group_size": 1
    }
],
"model_devi": [
    {
    "machine": {
        "batch": "shell",
        "work_path": "/home/ben/desktop/work/dpgen/test2/temp"
    },
    "resources": {
        "numb_gpu": 0,
        "task_per_node": 8,
        "partition": "cpu",
        "exclude_list": [],
        "mem_limit": 8,
        "source_list": [],
        "module_list": []
        },
    "command": " ~/desktop/1/lammps/src/lmp_mpi",
    "group_size": 1
    }
],
"fp": [
    {
    "machine": {
        "batch": "shell",
        "work_path": "/home/ben/desktop/work/dpgen/test2/temp"
    },
    "resources": {
        "numb_gpu": 0,
        "task_per_node": 8,
        "with_mpi": false,
        "source_list": ["/home/ben/intel/parallel_studio_xe_2019.5.075/psxevars.sh"],
        "module_list": [],
        "partition": "cpu",
    },
    "command": "ulimit -s unlimited && mpirun -n 4 /home/ben/desktop/1/vasp.5.4.4/bin/vasp",
    "group_size": 30
    }
    ]
}
AnguseZhang commented 3 years ago

See https://github.com/deepmodeling/dpgen/issues/372

343333333 commented 3 years ago

See #372

thx for replying , i fix this problem ,but meet another, it seems like the problem of dp.(btw, the snapshot of my initdata is only 3 ,is that allright?) and i check the log file from the mission it says

# DEEPMD: ---Summary of DataSystem------------------------------------------------
# DEEPMD: found 1 system(s):
# DEEPMD:                                     system  natoms  bch_sz  n_bch   n_test   prob
# DEEPMD:                        ../data.init/deepmd       4       3       1       2  1.000
# DEEPMD: ------------------------------------------------------------------------
# DEEPMD: 
# DEEPMD: training without frame parameter
Traceback (most recent call last):
  File "/home/ben/desktop/1/yes/bin/dp", line 10, in <module>
    sys.exit(main())
  File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/deepmd/main.py", line 73, in main
    train(args)
  File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/deepmd/train.py", line 87, in train
    _do_work(jdata, run_opt)
  File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/deepmd/train.py", line 140, in _do_work
    model.build (data, stop_batch)
  File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/deepmd/Trainer.py", line 227, in build
    self.model.data_stat(data)
  File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/deepmd/Model.py", line 115, in data_stat
    self._compute_input_stat(m_all_stat, protection = self.data_stat_protect)
  File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/deepmd/Model.py", line 120, in _compute_input_stat
    self.descrpt.compute_input_stats(all_stat['coord'],
  File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/deepmd/DescrptSeA.py", line 121, in compute_input_stats
    = self._compute_dstats_sys_smth(cc,bb,tt,nn,mm)
  File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/deepmd/DescrptSeA.py", line 284, in _compute_dstats_sys_smth
    = self.sub_sess.run(self.stat_descrpt, 
  File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 957, in run
    result = self._run(None, fetches, feed_dict, options_ptr,
  File "/home/ben/desktop/1/yes/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1154, in _run
    raise ValueError(
ValueError: Cannot feed value of shape (3,) for Tensor 'd_sea_t_natoms:0', which has shape '(4,)'

and this is the dpgen's log

Description
------------
INFO:dpgen:-------------------------iter.000000 task 01--------------------------
INFO:dpgen:new submission of 0972aa4f-dad6-496e-b8d2-28c618009df2 for chunk 8aefb06c426e07a0a671a1e2488b4858d694a730
INFO:dpgen:new submission of ad2bc6bf-826b-45e7-b5e9-27348968f9df for chunk e193a01ecf8d30ad0affefd332ce934e32ffce72
INFO:dpgen:job 0972aa4f-dad6-496e-b8d2-28c618009df2 terminated, submit again
INFO:dpgen:job ad2bc6bf-826b-45e7-b5e9-27348968f9df terminated, submit again
INFO:dpgen:job 0972aa4f-dad6-496e-b8d2-28c618009df2 terminated, submit again
INFO:dpgen:job ad2bc6bf-826b-45e7-b5e9-27348968f9df terminated, submit again
INFO:dpgen:job 0972aa4f-dad6-496e-b8d2-28c618009df2 terminated, submit again
INFO:dpgen:job ad2bc6bf-826b-45e7-b5e9-27348968f9df terminated, submit again
Traceback (most recent call last):
  File "/home/ben/.local/bin/dpgen", line 8, in <module>
    sys.exit(main())
  File "/home/ben/.local/lib/python3.8/site-packages/dpgen/main.py", line 175, in main
    args.func(args)
  File "/home/ben/.local/lib/python3.8/site-packages/dpgen/generator/run.py", line 2410, in gen_run
    run_iter (args.PARAM, args.MACHINE)
  File "/home/ben/.local/lib/python3.8/site-packages/dpgen/generator/run.py", line 2373, in run_iter
    run_train  (ii, jdata, mdata)
  File "/home/ben/.local/lib/python3.8/site-packages/dpgen/generator/run.py", line 529, in run_train
    dispatcher.run_jobs(mdata['train_resources'],
  File "/home/ben/.local/lib/python3.8/site-packages/dpgen/dispatcher/Dispatcher.py", line 91, in run_jobs
    while not self.all_finished(job_handler, mark_failure) :
  File "/home/ben/.local/lib/python3.8/site-packages/dpgen/dispatcher/Dispatcher.py", line 216, in all_finished
    raise RuntimeError('Job %s failed for more than 3 times' % job_uuid)
RuntimeError: Job 0972aa4f-dad6-496e-b8d2-28c618009df2 failed for more than 3 times

my machine.json is

{
"train": [
    {
        "command": "/home/ben/desktop/1/yes/bin/dp",
    "machine": {
        "batch": "shell",
        "_hostname": "localhost",
        "_port" : 22,
        "username" :"ben" ,
        "work_path": "/home/ben/desktop/work/dpgen/test3/temp"
        },
    "resources": {
        "numb_gpu": 0,
        "numb_node" :1 ,
        "task_per_node": 2,
        "partition": "cpu",
        "exclude_list": [],
        "mem_limit": 8,
        "source_list": [],
        "module_list": []
        }
    }
],
"model_devi": [
    {
    "machine": {
        "batch": "shell",
        "work_path": "/home/ben/desktop/work/dpgen/test3/temp"
    },
    "resources": {
        "numb_gpu": 0,
        "task_per_node": 4,
        "partition": "cpu",
        "exclude_list": [],
        "mem_limit": 8,
        "source_list": [],
        "module_list": []
        },
    "command": " ~/desktop/1/lammps/src/lmp_mpi",
    "group_size": 1
    }
],
"fp": [
    {
    "machine": {
        "batch": "shell",
        "work_path": "/home/ben/desktop/work/dpgen/test3/temp"
    },
    "resources": {
        "numb_gpu": 0,
        "task_per_node": 4,
        "with_mpi": false,
        "source_list": ["/home/ben/intel/parallel_studio_xe_2019.5.075/psxevars.sh"],
        "module_list": [],
        "partition": "cpu",
        "_envs" : {"PATH" : "/root/vasp/bin:$PATH"}
    },
    "command": "ulimit -s unlimited && mpirun -n 4 /home/ben/desktop/1/vasp.5.4.4/bin/vasp",
    "group_size": 30
    }
    ]
}
343333333 commented 3 years ago

oh i try it out . the data set is too small . i use a bigger one and it work well so far .

AnguseZhang commented 3 years ago

This issue is solved. I've closed this issue. If there is still any problem, you can reopen this issue or create a new issue.