dpgen run encount an error

tfcao888666 commented 3 years ago

Hi All, I have finished the ini step and try to train my model with dpgen run param.json machine, and encount this error "Description

Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 2410, in gen_run run_iter (args.PARAM, args.MACHINE) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 2373, in run_iter run_train (ii, jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 529, in run_train dispatcher.run_jobs(mdata['train_resources'], File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 140, in submit_jobs rjob['context'].upload('.', File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 82, in upload os.remove(os.path.join(remote_job, jj)) IsADirectoryError: [Errno 21] Is a directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/02.md/sys-0002-0008/deepmd/set.000' " Here is the parameter file, "type_map": ["C", "H"], "mass_map": [12.0, 1.0], "_comment": "initial data set for Training and the number of frames in each training batch", "init_data_prefix": "/scratch/04587/tfcao/ch4-large", "init_data_sys": [ "/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/02.md/sys-0002-0008/deepmd" ], "init_batch_size": [ 8 ], "sys_configs_prefix": "/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/02.md/sys-0002-0008/deepmd", "sys_configs": [ ["/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/01.scale_pert/sys-0002-0008/scale/00000/POSCAR"], ["/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/01.scale_pert/sys-0002-0008/scale/00001/POSCAR"] ], "sys_batch_size": [ 8, 8 ], "_comment": " 00.train ", "numb_models": 4, "default_training_param": { "model": { "type_map": ["C","H"], "descriptor": { "type": "se_a", "sel": [16,4], "rcut_smth": 0.5, "rcut": 5.0, "neuron": [10,20,40], "resnet_dt": false, "axis_neuron": 12, "seed": 1 }, "fitting_net": { "neuron": [120,120,120], "resnet_dt": true, "coord_norm": true, "type_fitting_net": false, "seed": 1 } }, "loss": { "start_pref_e": 0.02, "limit_pref_e": 2, "start_pref_f": 1000, "limit_pref_f": 1, "start_pref_v": 0, "limit_pref_v": 0 }, "learning_rate": { "type": "exp", "start_lr": 0.001, "decay_steps": 10, "decay_rate": 0.95 }, "training": { "systems": [], "set_prefix": "set", "stop_batch": 5000, "batch_size": 1, "seed": 1, "_comment": "frequencies counted in batch", "disp_file": "lcurve.out", "disp_freq": 1000, "numb_test": 4, "save_freq": 1000, "save_ckpt": "model.ckpt", "load_ckpt": "model.ckpt", "disp_training": true, "time_training": true, "profiling": false, "profiling_file": "timeline.json" } }, "_comment": " 01.model_devi ", "model_devi_dt": 0.002, "model_devi_skip": 0, "model_devi_f_trust_lo": 0.01, "model_devi_f_trust_hi": 0.3, "model_devi_clean_traj": false, "model_devi_jobs": [ { "sys_idx": [ 0 ], "temps": [ 50 ], "press": [ 1 ], "trj_freq": 10, "nsteps": 1000, "ensemble": "nvt", "_idx": "00" }, { "sys_idx": [ 1 ], "temps": [ 50 ], "press": [ 1 ], "trj_freq": 10, "nsteps": 3000, "ensemble": "nvt", "_idx": "01" } ], "_comment": " 02.fp ", "fp_style": "vasp", "shuffle_poscar": false, "fp_task_max": 25, "fp_task_min": 8, "fp_pp_path": "/scratch/04587/tfcao/ch4-large", "fp_pp_files": ["/scratch/04587/tfcao/ch4-large/POT_C","/scratch/04587/tfcao/ch4-large/POT_H"], "fp_incar": "/scratch/04587/tfcao/ch4-large/INCAR_methane"

Here is the machine

" "train": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large" }, "resources": { "numb_node": 1, "task_per_node": 64, "partition": "CPU", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "1:0:0", "qos": "data" }, "python_path": "/home1/04587/tfcao/miniconda3/bin/python" } ], "model_devi": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large" }, "resources": { "numb_node": 1, "task_per_node": 64, "partition": "development", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "0:10:0", "qos": "data" }, "command": "ibrun tacc_affinity /home1/04587/tfcao/miniconda3/bin/lmp", "group_size": 1 } ], "fp": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large" }, "resources": { "numb_node": 1, "task_per_node": 64, "exclude_list": [], "with_mpi": false, "source_list": [], "module_list": [], "time_limit": "0:10:0", "partition": "development", "_comment": "that's All" }, "command": "ibrun tacc_affinity /home1/04587/tfcao/vasp_bin/regular/vasp", "group_size": 1 } ] } "

Could you tell me the possible reasons? Thank you!

Before asking questions, you can

search the previous issues or discussions check Manual.

Please do not post requests for help (e.g. with installing or using dpgen) here. Instead go to discussions.

This issue tracker is for tracking dpgen development related issues only.

Thanks for your cooperation.

AnguseZhang commented 3 years ago

Create a folder in /scratch/04587/tfcao/ch4-large named work, change work_path in machine.json to. "/scratch/04587/tfcao/ch4-large/work" and try again.

tfcao888666 commented 3 years ago

HiYuzi， Thank you! The error is still there.{ "train": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "partition": "CPU", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "1:0:0", "qos": "data" }, "python_path": "/home1/04587/tfcao/miniconda3/bin/python" } ], "model_devi": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "partition": "development", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "0:10:0", "qos": "data" }, "command": "ibrun tacc_affinity /home1/04587/tfcao/miniconda3/bin/lmp", "group_size": 1 } ], "fp": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "exclude_list": [], "with_mpi": false, "source_list": [], "module_list": [], "time_limit": "0:10:0", "partition": "development", "_comment": "that's All" }, "command": "ibrun tacc_affinity /home1/04587/tfcao/vasp_bin/regular/vasp", "group_size": 1 } ] } ~{ "train": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "partition": "CPU", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "1:0:0", "qos": "data" }, "python_path": "/home1/04587/tfcao/miniconda3/bin/python" } ], "model_devi": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "partition": "development", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "0:10:0", "qos": "data" }, "command": "ibrun tacc_affinity /home1/04587/tfcao/miniconda3/bin/lmp", "group_size": 1 } ], "fp": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "exclude_list": [], "with_mpi": false, "source_list": [], "module_list": [], "time_limit": "0:10:0", "partition": "development", "_comment": "that's All" }, "command": "ibrun tacc_affinity /home1/04587/tfcao/vasp_bin/regular/vasp", "group_size": 1 } ] } ~

On Fri, Jun 25, 2021 at 3:10 AM AnguseZhang @.***> wrote:

Create a folder in /scratch/04587/tfcao/ch4-large named work, change work_path in machine.json to. "/scratch/04587/tfcao/ch4-large/work` and try again.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/448#issuecomment-868391710, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTEY6ERR2XYIQVPXE2DTURIYPANCNFSM47JQKG5Q .

AnguseZhang commented 3 years ago

Make sure the folder work is empty.
Is there a file called record.machine? If yes, delete it and try again.
If there is still error, please provide the complete error log.

Yuzhi

tfcao888666 commented 3 years ago

Hi Yuzhi The error is still there."login2.stampede2(1059)$ INFO:dpgen:-------------------------iter.000000 task 01-------------------------- Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 2410, in gen_run run_iter (args.PARAM, args.MACHINE) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 2373, in run_iter run_train (ii, jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 529, in run_train dispatcher.run_jobs(mdata['train_resources'], File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 140, in submit_jobs rjob['context'].upload('.', File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 82, in upload os.remove(os.path.join(remote_job, jj)) IsADirectoryError: [Errno 21] Is a directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/02.md/sys-0002-0008/deepmd/set.000'

[2]- Exit 1 dpgen run param.json machine.json > log.out " Sorry to bother you again and again

AnguseZhang commented 3 years ago

Feel free. You're welcomed to report the issues. I've found the problem, and thanks for reporting.

Change the absolute path of "init_data_sys": [ "/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/02.md/sys-0002-0008/deepmd" ] to relative path "ini/POSCAR.01x01x01/02.md/sys-0002-0008/deepmd" (Given that "/scratch/04587/tfcao/ch4-large" is your current path.).

Do the same on sys_configs.

Yuzhi

tfcao888666 commented 3 years ago

I change the absolute path. This kind of error comes back again. (base) login4.stampede2(1026)$ dpgen run param.json machine > log.out INFO:dpgen:-------------------------iter.000000 task 01-------------------------- Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 2410, in gen_run run_iter (args.PARAM, args.MACHINE) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 2373, in run_iter run_train (ii, jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/generator/run.py", line 529, in run_train dispatcher.run_jobs(mdata['train_resources'], File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/work/4f5d3894-94ed-4e12-8fd5-c6e6a71f46bc && sbatch 4f5d3894-94ed-4e12-8fd5-c6e6a71f46bc.sub', '4f5d3894-94ed-4e12-8fd5-c6e6a71f46bc'))'

Here is Parameter file: "type_map": ["C", "H"], "mass_map": [12.0, 1.0], "_comment": "initial data set for Training and the number of frames in each training batch", "init_data_prefix": "/scratch/04587/tfcao/ch4-large", "init_data_sys": [ "ini/POSCAR.01x01x01/02.md/sys-0002-0008/deepmd" ], "init_batch_size": [ 8 ], "sys_configs_prefix": "/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/02.md/sys-0002-0008/deepmd", "sys_configs": [ ["ini/POSCAR.01x01x01/01.scale_pert/sys-0002-0008/scale/00000/POSCAR"], ["ini/POSCAR.01x01x01/01.scale_pert/sys-0002-0008/scale/00001/POSCAR"] ], "sys_batch_size": [ 8, 8 ], "_comment": " 00.train ", "numb_models": 4, "default_training_param": { "model": { "type_map": ["C","H"], "descriptor": { "type": "se_a", "sel": [16,4], "rcut_smth": 0.5, "rcut": 5.0, "neuron": [10,20,40], "resnet_dt": false, "axis_neuron": 12, "seed": 1 }, "fitting_net": { "neuron": [120,120,120], "resnet_dt": true, "coord_norm": true, "type_fitting_net": false, "seed": 1 } }, "loss": { "start_pref_e": 0.02, "limit_pref_e": 2, "start_pref_f": 1000, "limit_pref_f": 1, "start_pref_v": 0, "limit_pref_v": 0 }, "learning_rate": { "type": "exp", "start_lr": 0.001, "decay_steps": 10, "decay_rate": 0.95 }, "training": { "systems": [], "set_prefix": "set", "stop_batch": 5000, "batch_size": 1, "seed": 1, "_comment": "frequencies counted in batch", "disp_file": "lcurve.out", "disp_freq": 1000, "numb_test": 4, "save_freq": 1000, "save_ckpt": "model.ckpt", "load_ckpt": "model.ckpt", "disp_training": true, "time_training": true, "profiling": false, "profiling_file": "timeline.json" } }, "_comment": " 01.model_devi ", "model_devi_dt": 0.002, "model_devi_skip": 0, "model_devi_f_trust_lo": 0.01, "model_devi_f_trust_hi": 0.3, "model_devi_clean_traj": false, "model_devi_jobs": [ { "sys_idx": [ 0 ], "temps": [ 50 ], "press": [ 1 ], "trj_freq": 10, "nsteps": 1000, "ensemble": "nvt", "_idx": "00" }, { "sys_idx": [ 1 ], "temps": [ 50 ], "press": [ 1 ], "trj_freq": 10, "nsteps": 3000, "ensemble": "nvt", "_idx": "01" } ], "_comment": " 02.fp ", "fp_style": "vasp", "shuffle_poscar": false, "fp_task_max": 25, "fp_task_min": 8, "fp_pp_path": "/scratch/04587/tfcao/ch4-large/work", "fp_pp_files": ["/scratch/04587/tfcao/ch4-large/POT_C","/scratch/04587/tfcao/ch4-large/POT_H"], "fp_incar": "/scratch/04587/tfcao/ch4-large/INCAR_methane" } Here is the machine file.

"train": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "partition": "CPU", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "1:0:0", "qos": "data" }, "python_path": "/home1/04587/tfcao/miniconda3/bin/python" } ], "model_devi": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "partition": "normal", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "0:10:0", "qos": "data" }, "command": "ibrun tacc_affinity /home1/04587/tfcao/miniconda3/bin/lmp", "group_size": 1 } ], "fp": [ { "machine": { "batch": "slurm", "work_path": "/scratch/04587/tfcao/ch4-large/work" }, "resources": { "numb_node": 1, "task_per_node": 64, "exclude_list": [], "with_mpi": false, "source_list": [], "module_list": [], "time_limit": "0:10:0", "partition": "normal", "_comment": "that's All" }, "command": "ibrun tacc_affinity /home1/04587/tfcao/vasp_bin/regular/vasp", "group_size": 1 } ] }

AnguseZhang commented 3 years ago

Then you should follow the suggestions in #438 and see what will happen.

Seems things proceeds. Your slurm settings have problem. Execute cd /scratch/04587/tfcao/ch4-large/ini/06051a5b-3952-4154-b7d1-3ecb1861a017 && sbatch 06051a5b-3952-4154-b7d1-3ecb1861a017.sub in command line, and see the error log.

tfcao888666 commented 3 years ago

Hi Yuzhi， Thank you for the response. I have fixed all the errors. It is related to the setting of our clusters. Best Tengfei

On Sat, Jun 26, 2021 at 2:17 AM AnguseZhang @.***> wrote:

Then you should follow the suggestions in #438 https://github.com/deepmodeling/dpgen/issues/438 and see what will happen.

Seems things proceeds. Your slurm settings have problem. Execute cd /scratch/04587/tfcao/ch4-large/ini/06051a5b-3952-4154-b7d1-3ecb1861a017 && sbatch 06051a5b-3952-4154-b7d1-3ecb1861a017.sub in command line, and see the error log.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/448#issuecomment-868973993, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTCQIVVVHD5G7KWKIHLTUWLLZANCNFSM47JQKG5Q .

deepmodeling / dpgen

dpgen run encount an error #448

Hi All, I have finished the ini step and try to train my model with dpgen run param.json machine, and encount this error "Description