Closed tfcao888666 closed 1 year ago
Create a folder in /scratch/04587/tfcao/ch4-large
named work
, change work_path
in machine.json
to. "/scratch/04587/tfcao/ch4-large/work" and try again.
HiYuzi, Thank you! The error is still there.{ "train": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "partition": "CPU", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "1:0:0", "qos": "data" }, "python_path": "/home1/04587/tfcao/miniconda3/bin/python" } ], "model_devi": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "partition": "development", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "0:10:0", "qos": "data" }, "command": "ibrun tacc_affinity /home1/04587/tfcao/miniconda3/bin/lmp", "group_size": 1 } ], "fp": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "exclude_list": [], "with_mpi": false, "source_list": [], "module_list": [], "time_limit": "0:10:0", "partition": "development", "_comment": "that's All" }, "command": "ibrun tacc_affinity /home1/04587/tfcao/vasp_bin/regular/vasp", "group_size": 1 } ] } ~{ "train": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "partition": "CPU", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "1:0:0", "qos": "data" }, "python_path": "/home1/04587/tfcao/miniconda3/bin/python" } ], "model_devi": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "partition": "development", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "0:10:0", "qos": "data" }, "command": "ibrun tacc_affinity /home1/04587/tfcao/miniconda3/bin/lmp", "group_size": 1 } ], "fp": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "exclude_list": [], "with_mpi": false, "source_list": [], "module_list": [], "time_limit": "0:10:0", "partition": "development", "_comment": "that's All" }, "command": "ibrun tacc_affinity /home1/04587/tfcao/vasp_bin/regular/vasp", "group_size": 1 } ] } ~
On Fri, Jun 25, 2021 at 3:10 AM AnguseZhang @.***> wrote:
Create a folder in /scratch/04587/tfcao/ch4-large named work, change work_path in machine.json to. "/scratch/04587/tfcao/ch4-large/work` and try again.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/448#issuecomment-868391710, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTEY6ERR2XYIQVPXE2DTURIYPANCNFSM47JQKG5Q .
work
is empty. record.machine
? If yes, delete it and try again.Yuzhi
Hi Yuzhi
The error is still there."login2.stampede2(1059)$ INFO:dpgen:-------------------------iter.000000 task 01--------------------------
Traceback (most recent call last):
File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in
[2]- Exit 1 dpgen run param.json machine.json > log.out " Sorry to bother you again and again
Feel free. You're welcomed to report the issues. I've found the problem, and thanks for reporting.
Change the absolute path of "init_data_sys": [ "/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/02.md/sys-0002-0008/deepmd" ] to relative path "ini/POSCAR.01x01x01/02.md/sys-0002-0008/deepmd" (Given that "/scratch/04587/tfcao/ch4-large" is your current path.).
Do the same on sys_configs.
Yuzhi
I change the absolute path. This kind of error comes back again.
(base) login4.stampede2(1026)$ dpgen run param.json machine > log.out
INFO:dpgen:-------------------------iter.000000 task 01--------------------------
Traceback (most recent call last):
File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in
Here is Parameter file: "type_map": ["C", "H"], "mass_map": [12.0, 1.0], "_comment": "initial data set for Training and the number of frames in each training batch", "init_data_prefix": "/scratch/04587/tfcao/ch4-large", "init_data_sys": [ "ini/POSCAR.01x01x01/02.md/sys-0002-0008/deepmd" ], "init_batch_size": [ 8 ], "sys_configs_prefix": "/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/02.md/sys-0002-0008/deepmd", "sys_configs": [ ["ini/POSCAR.01x01x01/01.scale_pert/sys-0002-0008/scale/00000/POSCAR"], ["ini/POSCAR.01x01x01/01.scale_pert/sys-0002-0008/scale/00001/POSCAR"] ], "sys_batch_size": [ 8, 8 ], "_comment": " 00.train ", "numb_models": 4, "default_training_param": { "model": { "type_map": ["C","H"], "descriptor": { "type": "se_a", "sel": [16,4], "rcut_smth": 0.5, "rcut": 5.0, "neuron": [10,20,40], "resnet_dt": false, "axis_neuron": 12, "seed": 1 }, "fitting_net": { "neuron": [120,120,120], "resnet_dt": true, "coord_norm": true, "type_fitting_net": false, "seed": 1 } }, "loss": { "start_pref_e": 0.02, "limit_pref_e": 2, "start_pref_f": 1000, "limit_pref_f": 1, "start_pref_v": 0, "limit_pref_v": 0 }, "learning_rate": { "type": "exp", "start_lr": 0.001, "decay_steps": 10, "decay_rate": 0.95 }, "training": { "systems": [], "set_prefix": "set", "stop_batch": 5000, "batch_size": 1, "seed": 1, "_comment": "frequencies counted in batch", "disp_file": "lcurve.out", "disp_freq": 1000, "numb_test": 4, "save_freq": 1000, "save_ckpt": "model.ckpt", "load_ckpt": "model.ckpt", "disp_training": true, "time_training": true, "profiling": false, "profiling_file": "timeline.json" } }, "_comment": " 01.model_devi ", "model_devi_dt": 0.002, "model_devi_skip": 0, "model_devi_f_trust_lo": 0.01, "model_devi_f_trust_hi": 0.3, "model_devi_clean_traj": false, "model_devi_jobs": [ { "sys_idx": [ 0 ], "temps": [ 50 ], "press": [ 1 ], "trj_freq": 10, "nsteps": 1000, "ensemble": "nvt", "_idx": "00" }, { "sys_idx": [ 1 ], "temps": [ 50 ], "press": [ 1 ], "trj_freq": 10, "nsteps": 3000, "ensemble": "nvt", "_idx": "01" } ], "_comment": " 02.fp ", "fp_style": "vasp", "shuffle_poscar": false, "fp_task_max": 25, "fp_task_min": 8, "fp_pp_path": "/scratch/04587/tfcao/ch4-large/work", "fp_pp_files": ["/scratch/04587/tfcao/ch4-large/POT_C","/scratch/04587/tfcao/ch4-large/POT_H"], "fp_incar": "/scratch/04587/tfcao/ch4-large/INCAR_methane" } Here is the machine file.
"train": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "partition": "CPU", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "1:0:0", "qos": "data" }, "python_path": "/home1/04587/tfcao/miniconda3/bin/python" } ], "model_devi": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "partition": "normal", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "0:10:0", "qos": "data" }, "command": "ibrun tacc_affinity /home1/04587/tfcao/miniconda3/bin/lmp", "group_size": 1 } ], "fp": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "exclude_list": [], "with_mpi": false, "source_list": [], "module_list": [], "time_limit": "0:10:0", "partition": "normal", "_comment": "that's All" }, "command": "ibrun tacc_affinity /home1/04587/tfcao/vasp_bin/regular/vasp", "group_size": 1 } ] }
Then you should follow the suggestions in #438 and see what will happen.
Seems things proceeds. Your slurm settings have problem. Execute
cd /scratch/04587/tfcao/ch4-large/ini/06051a5b-3952-4154-b7d1-3ecb1861a017 && sbatch 06051a5b-3952-4154-b7d1-3ecb1861a017.sub
in command line, and see the error log.
Hi Yuzhi, Thank you for the response. I have fixed all the errors. It is related to the setting of our clusters. Best Tengfei
On Sat, Jun 26, 2021 at 2:17 AM AnguseZhang @.***> wrote:
Then you should follow the suggestions in #438 https://github.com/deepmodeling/dpgen/issues/438 and see what will happen.
Seems things proceeds. Your slurm settings have problem. Execute cd /scratch/04587/tfcao/ch4-large/ini/06051a5b-3952-4154-b7d1-3ecb1861a017 && sbatch 06051a5b-3952-4154-b7d1-3ecb1861a017.sub in command line, and see the error log.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/448#issuecomment-868973993, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTCQIVVVHD5G7KWKIHLTUWLLZANCNFSM47JQKG5Q .
Hi All, I have finished the ini step and try to train my model with dpgen run param.json machine, and encount this error "Description
Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in
sys.exit(main())
File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main
args.func(args)
File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 2410, in gen_run
run_iter (args.PARAM, args.MACHINE)
File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 2373, in run_iter
run_train (ii, jdata, mdata)
File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 529, in run_train
dispatcher.run_jobs(mdata['train_resources'],
File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs
job_handler = self.submit_jobs(resources,
File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 140, in submit_jobs
rjob['context'].upload('.',
File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 82, in upload
os.remove(os.path.join(remote_job, jj))
IsADirectoryError: [Errno 21] Is a directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/02.md/sys-0002-0008/deepmd/set.000'
"
Here is the parameter file,
"type_map": ["C", "H"],
"mass_map": [12.0, 1.0],
"_comment": "initial data set for Training and the number of frames in each training batch",
"init_data_prefix": "/scratch/04587/tfcao/ch4-large",
"init_data_sys": [
"/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/02.md/sys-0002-0008/deepmd"
],
"init_batch_size": [
8
],
"sys_configs_prefix": "/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/02.md/sys-0002-0008/deepmd",
"sys_configs": [
["/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/01.scale_pert/sys-0002-0008/scale/00000/POSCAR"],
["/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/01.scale_pert/sys-0002-0008/scale/00001/POSCAR"]
],
"sys_batch_size": [
8,
8
],
"_comment": " 00.train ",
"numb_models": 4,
"default_training_param": {
"model": {
"type_map": ["C","H"],
"descriptor": {
"type": "se_a",
"sel": [16,4],
"rcut_smth": 0.5,
"rcut": 5.0,
"neuron": [10,20,40],
"resnet_dt": false,
"axis_neuron": 12,
"seed": 1
},
"fitting_net": {
"neuron": [120,120,120],
"resnet_dt": true,
"coord_norm": true,
"type_fitting_net": false,
"seed": 1
}
},
"loss": {
"start_pref_e": 0.02,
"limit_pref_e": 2,
"start_pref_f": 1000,
"limit_pref_f": 1,
"start_pref_v": 0,
"limit_pref_v": 0
},
"learning_rate": {
"type": "exp",
"start_lr": 0.001,
"decay_steps": 10,
"decay_rate": 0.95
},
"training": {
"systems": [],
"set_prefix": "set",
"stop_batch": 5000,
"batch_size": 1,
"seed": 1,
"_comment": "frequencies counted in batch",
"disp_file": "lcurve.out",
"disp_freq": 1000,
"numb_test": 4,
"save_freq": 1000,
"save_ckpt": "model.ckpt",
"load_ckpt": "model.ckpt",
"disp_training": true,
"time_training": true,
"profiling": false,
"profiling_file": "timeline.json"
}
},
"_comment": " 01.model_devi ",
"model_devi_dt": 0.002,
"model_devi_skip": 0,
"model_devi_f_trust_lo": 0.01,
"model_devi_f_trust_hi": 0.3,
"model_devi_clean_traj": false,
"model_devi_jobs": [
{
"sys_idx": [
0
],
"temps": [
50
],
"press": [
1
],
"trj_freq": 10,
"nsteps": 1000,
"ensemble": "nvt",
"_idx": "00"
},
{
"sys_idx": [
1
],
"temps": [
50
],
"press": [
1
],
"trj_freq": 10,
"nsteps": 3000,
"ensemble": "nvt",
"_idx": "01"
}
],
"_comment": " 02.fp ",
"fp_style": "vasp",
"shuffle_poscar": false,
"fp_task_max": 25,
"fp_task_min": 8,
"fp_pp_path": "/scratch/04587/tfcao/ch4-large",
"fp_pp_files": ["/scratch/04587/tfcao/ch4-large/POT_C","/scratch/04587/tfcao/ch4-large/POT_H"],
"fp_incar": "/scratch/04587/tfcao/ch4-large/INCAR_methane"
Here is the machine
" "train": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large" }, "resources": { "numb_node": 1, "task_per_node": 64, "partition": "CPU", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "1:0:0", "qos": "data" }, "python_path": "/home1/04587/tfcao/miniconda3/bin/python" } ], "model_devi": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large" }, "resources": { "numb_node": 1, "task_per_node": 64, "partition": "development", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "0:10:0", "qos": "data" }, "command": "ibrun tacc_affinity /home1/04587/tfcao/miniconda3/bin/lmp", "group_size": 1 } ], "fp": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large" }, "resources": { "numb_node": 1, "task_per_node": 64, "exclude_list": [], "with_mpi": false, "source_list": [], "module_list": [], "time_limit": "0:10:0", "partition": "development", "_comment": "that's All" }, "command": "ibrun tacc_affinity /home1/04587/tfcao/vasp_bin/regular/vasp", "group_size": 1 } ] } "
Could you tell me the possible reasons? Thank you!
Before asking questions, you can
search the previous issues or discussions check Manual.
Please do not post requests for help (e.g. with installing or using dpgen) here. Instead go to discussions.
This issue tracker is for tracking dpgen development related issues only.
Thanks for your cooperation.