deepmodeling / dpgen

The deep potential generator to generate a deep-learning based model of interatomic potential energy and force field
https://docs.deepmodeling.com/projects/dpgen/
GNU Lesser General Public License v3.0
292 stars 173 forks source link

Ssh connection fails, change to local context #438

Closed tfcao888666 closed 3 years ago

tfcao888666 commented 3 years ago

I have sucefully installed dpgen, but I encount this issue when I run my jobs. Ang suggestiosn on it. Thanks! Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 710, in gen_init_bulk make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 322, in make_vasp_relax with open(fname) as infile: IsADirectoryError: [Errno 21] Is a directory: '/'

AnguseZhang commented 3 years ago

Please provide necessary information including the version of software and installation way, input file, running commands error log , etc., AS DETAILED AS POSSIBLE to help locate and reproduce your problem. What's your param.json for init_bulk? Seems there is problem of directory of POTCAR

tfcao888666 commented 3 years ago

I have sucefully installed dpgen, but I encount this issue when I run my jobs. Ang suggestiosn on it. Thanks! Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 710, in gen_init_bulk make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 322, in make_vasp_relax with open(fname) as infile: IsADirectoryError: [Errno 21] Is a directory: '/' { "python_path": "~/miniconda3/bin/python", "train_machine": { "machine_type": "slurm", "hostname" : "stampede2.tacc.xsede.org", "port" : 22, "username": "tfcao", "work_path" : "/scratch/04587/tfcao/ch4-large/ini", "_comment" : "that's all" }, "train_resources": { "numb_node": 1, "task_per_node":48, "partition" : "skx-normal", "exclude_list" : [], "module_list": [ ], "source_list": ["~/miniconda3/bin/activate" ], "time_limit": "2:00:0", "_comment": "that's all" },

"lmp_command":      "lmp",
"model_devi_group_size":    1,
"_comment":         "model_devi on localhost",
"model_devi_machine":       {
    "machine_type": "slurm",
    "hostname" :    "stampede2.tacc.xsede.org",
    "port" :        22,
    "username":     "tfcao",
    "work_path" :   "/scratch/04587/tfcao/ch4-large/ini",
    "_comment" :    "that's all"
},
"_comment": " if use GPU, numb_nodes(nn) should always be 1 ",
"_comment": " if numb_nodes(nn) = 1 multi-threading rather than mpi is assumed",
"model_devi_resources":     {
    "numb_node":    1,
    "task_per_node":48,
    "source_list":  ["~/miniconda3/bin/activate" ],
    "module_list":  [ ],
    "time_limit":   "2:00:0",
    "partition" : "skx-normal",
    "_comment":     "that's all"
},

"_comment":         "fp on localhost ",
"fp_command":       "ibrun /home1/04587/tfcao/vasp_bin/regular/vasp",
"fp_group_size":    1,
"fp_machine":       {
    "machine_type": "slurm",
    "hostname" :    "stampede2.tacc.xsede.org",
    "port" :        22,
    "username":     "tfcao",
    "work_path" :   "/scratch/04587/tfcao/ch4-large/ini",
    "_comment" :    "that's all"
},
"fp_resources":     {
    "numb_node":    1,
    "task_per_node":48,
    "numb_gpu":     0,
    "exclude_list" : [],
    "source_list":  [],
    "module_list":  [],
    "with_mpi" : false,
    "time_limit":   "2:00:0",
    "partition" : "skx-normal",
    "_comment":     "that's all"
},
"_comment":         " that's all "

} ~
~ machine { "stages" : [1, 2, 3, 4], "elements": ["H","C"], "cell_type": "diamond", "latt": 10.0, "super_cell": [1, 1, 1], "from_poscar": true, "from_poscar_path": "/scratch/04587/tfcao/ch4-large/ini/POSCAR", "potcars": "/scratch/04587/tfcao/ch4-large/ini/POTCAR", "relax_incar": "/scratch/04587/tfcao/ch4-large/ini/INCAR_rlx", "md_incar" : "/scratch/04587/tfcao/ch4-large/ini/INCAR_md", "skip_relax": false, "scale": [1.00], "pert_numb": 20, "md_nstep" : 1000, "pert_box": 0.03, "pert_atom": 0.01, "coll_ndata": 5000, "type_map" : ["H","C"], "_comment": "that's all" }

tfcao888666 commented 3 years ago

Please provide necessary information including the version of software and installation way, input file, running commands error log , etc., AS DETAILED AS POSSIBLE to help locate and reproduce your problem. What's your param.json for init_bulk? Seems there is problem of directory of POTCAR

!/bin/bash

SBATCH -J dpgen # Job name

SBATCH -o 16core_t.o%j # Name of stdout output file(%j expands to jobId)

SBATCH -e 16core_t.e%j # Name of stderr output file(%j expands to jobId)

SBATCH -p development # Submit to the 'normal' or 'development' queue

SBATCH -N 1 # Total number of nodes requested (16 cores/node)

SBATCH -n 64 # Total number of mpi tasks requested

SBATCH -t 2:00:00 # Run time (hh:mm:ss) - 24 hours

SBATCH -A TG-DMR160007

module load vasp

export FORT_BUFFERED=true

conda activate dpgenDev

ibrun tacc_affinity /home1/apps/intel18/impi18_0/qe/6.3/bin/pw.x o3-scf.out

ibrun tacc_affinity $HOME/.local/bin/dpgen init_bulk param.json machine.json > log.out

ibrun $HOME/.local/bin/dpgen init_bulk param.json machine.json > log.out

ibrun $HOME/.local/bin/dpgen init_bulk param.json machine.json > log.out

ibrun ./vasp_std >& result

tfcao888666 commented 3 years ago

Please provide necessary information including the version of software and installation way, input file, running commands error log , etc., AS DETAILED AS POSSIBLE to help locate and reproduce your problem. What's your param.json for init_bulk? Seems there is problem of directory of POTCAR

The error chang into "natoms_list = [int(ii) for ii in natoms_str.split()] ValueError: invalid literal for int() with base 10: 'H'". I do not modify anything. It is strange.

AnguseZhang commented 3 years ago

About : IsADirectoryError: [Errno 21] Is a directory: '/' You might refer to a very old example. Please see https://github.com/deepmodeling/dpgen/blob/master/examples/init/ch4.json

You should provide a list (Either you can write ["POTCAR_H", "POTCAR_C"] or you can merge two potcars together into one file POTCAR and write ["POTCAR"], instead of "POTCAR"). But you should make sure the right order of H, C.

So change "potcars": "/scratch/04587/tfcao/ch4-large/ini/POTCAR", to "potcars": ["/scratch/04587/tfcao/ch4-large/ini/POTCAR"], and try again.

tfcao888666 commented 3 years ago

File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/paramiko/client.py", line 349, in retry_on_signal(lambda: sock.connect(addr)) TimeoutError: [Errno 110] Connection timed out retry_on_signal(lambda: sock.connect(addr)) TimeoutError: [Errno 110] Connection timed out retry_on_signal(lambda: sock.connect(addr)) TimeoutError: [Errno 110] Connection timed out

tfcao888666 commented 3 years ago

Thank you. I tried and the error changedm but it still can not run.

AnguseZhang commented 3 years ago

"TimeoutError: [Errno 110] Connection timed out" You settings for machine is not correct. Can you connect via "ssh username@stampede2.tacc.xsede.org" on your current machine?

tfcao888666 commented 3 years ago

Yes, I can connect it with ssh @. @.>. Here is all the output information: raceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 701, in gen_init_bulk create_path(out_dir) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 50, in create_path os.makedirs (path) File "/home1/04587/tfcao/miniconda3/lib/python3.9/os.py", line 225, in makedirs mkdir(name, mode) FileExistsError: [Errno 17] File exists: 'POSCAR.01x01x01/' Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 704, in gen_init_bulk args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 704, in gen_init_bulk make_super_cell_poscar(jdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 236, in make_super_cell_poscar make_super_cell_poscar(jdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 236, in make_super_cell_poscar Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in natoms_str = lines[6] IndexError: list index out of range natoms_str = lines[6] IndexError: list index out of range sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 710, in gen_init_bulk make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 308, in make_vasp_relax Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in os.remove(os.path.join(work_dir, 'INCAR' )) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/INCAR' Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 710, in gen_init_bulk Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 710, in gen_init_bulk Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 310, in make_vasp_relax Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 310, in make_vasp_relax sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 710, in gen_init_bulk os.remove(os.path.join(work_dir, 'POTCAR')) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/POTCAR' Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 710, in gen_init_bulk sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main os.remove(os.path.join(work_dir, 'POTCAR')) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/POTCAR' args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 710, in gen_init_bulk args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 710, in gen_init_bulk args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 710, in gen_init_bulk make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 310, in make_vasp_relax make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 310, in make_vasp_relax args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 710, in gen_init_bulk make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 310, in make_vasp_relax os.remove(os.path.join(work_dir, 'POTCAR')) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/POTCAR' make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 310, in make_vasp_relax os.remove(os.path.join(work_dir, 'POTCAR')) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/POTCAR' make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 310, in make_vasp_relax sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 310, in make_vasp_relax os.remove(os.path.join(work_dir, 'POTCAR')) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/POTCAR' os.remove(os.path.join(work_dir, 'POTCAR')) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/POTCAR' os.remove(os.path.join(work_dir, 'POTCAR')) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/POTCAR' os.remove(os.path.join(work_dir, 'POTCAR')) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/POTCAR' args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 710, in gen_init_bulk Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 310, in make_vasp_relax Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in

On Sun, Jun 20, 2021 at 5:12 PM AnguseZhang @.***> wrote:

"TimeoutError: [Errno 110] Connection timed out" You settings for machine is not correct. Can you connect via "ssh @.***" on your current machine?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/438#issuecomment-864634585, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTERGH4LZ43XPBMKQJTTTZ7VVANCNFSM46536NUQ .

AnguseZhang commented 3 years ago

What's your execution command? It shouldn't generate so many errors at one time.

tfcao888666 commented 3 years ago

I use this script to submit job.

!/bin/bash

SBATCH -J dpgen # Job name

SBATCH -o 16core_t.o%j # Name of stdout output file(%j expands to

jobId)

SBATCH -e 16core_t.e%j # Name of stderr output file(%j expands to

jobId)

SBATCH -p development # Submit to the 'normal' or 'development'

queue

SBATCH -N 1 # Total number of nodes requested (16

cores/node)

SBATCH -n 64 # Total number of mpi tasks requested

SBATCH -t 2:00:00 # Run time (hh:mm:ss) - 24 hours

SBATCH -A TG-DMR160007

module load vasp

export FORT_BUFFERED=true

conda activate dpgenDev

ibrun tacc_affinity /home1/apps/intel18/impi18_0/qe/6.3/bin/pw.x

o3-scf.out

ibrun tacc_affinity $HOME/.local/bin/dpgen init_bulk param.json

machine.json > log.out

ibrun $HOME/.local/bin/dpgen init_bulk param.json machine.json > log.out

ibrun tacc_affinity ~/.local/bin/dpgen init_bulk param.json machine.json > log.out

ibrun ./vasp_std >& result

On Sun, Jun 20, 2021 at 6:49 PM AnguseZhang @.***> wrote:

What's your execution command? It shouldn't generate so many errors at one time.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/438#issuecomment-864665590, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTEEGIFEZ6GOVVELZ3DTT2LDVANCNFSM46536NUQ .

AnguseZhang commented 3 years ago

What's the reason of using "ibrun"? What's its role? Can you directly run "dpgen init_bulk param.json machine.json"?

tfcao888666 commented 3 years ago

ibrun is used to run parallel job. It works like mpirun. I have delected it. The error is as following: Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 711, in gen_init_bulk run_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 584, in run_vasp_relax dispatcher = make_dispatcher(mdata['fp_machine'], mdata['fp_resources'], work_dir, run_tasks, fp_group_size) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 339, in make_dispatcher disp = Dispatcher(mdata, context_type=context_type, batch_type=batch_type) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 48, in init self.session = SSHSession(remote_profile) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/SSHContext.py", line 22, in init self._setup_ssh(hostname=self.remote_host, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/SSHContext.py", line 70, in _setup_ssh self.ssh.connect(hostname=hostname, port=port, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/paramiko/client.py", line 349, in connect retry_on_signal(lambda: sock.connect(addr)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/paramiko/util.py", line 283, in retry_on_signal return function() File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/paramiko/client.py", line 349, in retry_on_signal(lambda: sock.connect(addr)) TimeoutError: [Errno 110] Connection timed out

On Sun, Jun 20, 2021 at 6:56 PM AnguseZhang @.***> wrote:

What's the reason of using "ibrun"? What's its role? Can you directly run "dpgen init_bulk param.json machine.json"?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/438#issuecomment-864667412, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTDUGH7MTZ4YTQYOGKTTT2L5NANCNFSM46536NUQ .

AnguseZhang commented 3 years ago

OK. It seems connection fails. Do you use Slurm on your current machine, or you're connecting a remote machine? If the case is the former, you can specify "localhost" for "hostname" of machine.json and try again.

tfcao888666 commented 3 years ago

I do not think it is connection problem. I am wondering could we talk few minutes online. I can show you my questions. I indeed was stucked there for a long time. tfcao888666 invites you to a meeting on VooV Meeting Meeting Topic: Started by tfcao888666 Meeting Time: 2021/06/21 10:49-11:49 (GMT+08:00)

Click the link to join the meeting: https://voovmeeting.com/s/ZPwfUzU6eHox

Meeting ID: 503 044 804

Dial in: +1 4153389272 (United States) +1 3868680985 (United States)

Find your local number https://voovmeeting.com/mobile/redirect?page=pstn&region=df&lang=en Thank you! Best Tengfei

On Sun, Jun 20, 2021 at 7:12 PM AnguseZhang @.***> wrote:

OK. It seems connection fails. Do you use Slurm on your current machine, or you're connecting a remote machine? If the case is the former, you can specify "localhost" for "hostname" of machine.json and try again.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/438#issuecomment-864673253, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTDU5YNNONVPOIMOXQTTT2NY3ANCNFSM46536NUQ .

AnguseZhang commented 3 years ago

No. Actually I'm not available until Wednesday. You can post your opinions here.

tfcao888666 commented 3 years ago

Could you have a look and tell me where I can change? I cannot find the improper point.

"deepmd_path":      "~/miniconda3/bin/dp",
"train_machine":    {
    "machine_type": "slurm",
    "hostname" :    "stampede2.tacc.xsede.org",
    "port" :        22,
    "username":     "tfcao",
    "work_path" :   "/scratch/04587/tfcao/ch4-large/ini",
    "_comment" :    "that's all"
},
"train_resources":  {
    "numb_node":    1,
    "task_per_node":64,
    "partition" : "development",
    "exclude_list" : [],
    "source_list":  [ "~/miniconda3/bin/activate" ],
    "module_list":  [ ],
    "time_limit":   "2:00:0",
    "_comment":     "that's all"
},

"lmp_command":      "~/miniconda3/bin/lmp",
"model_devi_group_size":    1,
"_comment":         "model_devi on localhost",
"model_devi_machine":       {
    "machine_type": "slurm",
    "hostname" :    "stampede2.tacc.xsede.org",
    "port" :        22,
    "username":     "tfcao",
    "work_path" :   "/scratch/04587/tfcao/ch4-large/ini",
    "_comment" :    "that's all"
},
"_comment": " if use GPU, numb_nodes(nn) should always be 1 ",
"_comment": " if numb_nodes(nn) = 1 multi-threading rather than mpi is

assumed", "model_devi_resources": { "numb_node": 1, "task_per_node":64, "source_list": ["~/miniconda3/bin/activate" ], "module_list": [ ], "time_limit": "2:00:0", "partition" : "development", "_comment": "that's all" },

"_comment":         "fp on localhost ",
"fp_command":       "ibrun tacc_affinity

/home1/04587/tfcao/vasp_bin/regular/vasp", "fp_group_size": 1, "fp_machine": { "machine_type": "slurm", "hostname" : "stampede2.tacc.xsede.org", "port" : 22, "username": "tfcao", "work_path" : "/scratch/04587/tfcao/ch4-large/ini", "_comment" : "that's all" }, "fp_resources": { "numb_node": 1, "task_per_node":64, "numb_gpu": 0, "exclude_list" : [], "source_list": [], "module_list": [], "with_mpi" : false, "time_limit": "2:00:0", "partition" : "development", "_comment": "that's all" }, "_comment": " that's all " }

On Sun, Jun 20, 2021 at 8:00 PM AnguseZhang @.***> wrote:

No. Actually I'm not available until Wednesday. You can post your opinions here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/438#issuecomment-864687239, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTEUMBNDSS2WVI4CWLLTT2TNVANCNFSM46536NUQ .

AnguseZhang commented 3 years ago

OK. It seems connection fails. Do you use Slurm on your current machine, or you're connecting a remote machine? If the case is the former, you can specify "localhost" for "hostname" of machine.json and try again.

Did you try this?

tfcao888666 commented 3 years ago

Yes "deepmd_path": "~/miniconda3/bin/dp", "train_machine": { "machine_type": "slurm", "hostname" : "localhost", "port" : 22, "username": "tfcao", "work_path" : "/scratch/04587/tfcao/ch4-large/ini", "_comment" : "that's all" }, "train_resources": { "numb_node": 1, "task_per_node":64, "partition" : "development", "exclude_list" : [], "source_list": [ "~/miniconda3/bin/activate" ], "module_list": [ ], "time_limit": "2:00:0", "_comment": "that's all" },

"lmp_command":      "~/miniconda3/bin/lmp",
"model_devi_group_size":    1,
"_comment":         "model_devi on localhost",
"model_devi_machine":       {
    "machine_type": "slurm",
    "hostname" :    "localhost",
    "port" :        22,
    "username":     "tfcao",
    "work_path" :   "/scratch/04587/tfcao/ch4-large/ini",
    "_comment" :    "that's all"
},
"_comment": " if use GPU, numb_nodes(nn) should always be 1 ",
"_comment": " if numb_nodes(nn) = 1 multi-threading rather than mpi is

assumed", "model_devi_resources": { "numb_node": 1, "task_per_node":64, "source_list": ["~/miniconda3/bin/activate" ], "module_list": [ ], "time_limit": "2:00:0", "partition" : "development", "_comment": "that's all" },

"_comment":         "fp on localhost ",
"fp_command":       "ibrun tacc_affinity

/home1/04587/tfcao/vasp_bin/regular/vasp", "fp_group_size": 1, "fp_machine": { "machine_type": "slurm", "hostname" : "localhost", "port" : 22, "username": "tfcao", "work_path" : "/scratch/04587/tfcao/ch4-large/ini", "_comment" : "that's all" }, "fp_resources": { "numb_node": 1, "task_per_node":64, "numb_gpu": 0, "exclude_list" : [], "source_list": [], "module_list": [], "with_mpi" : false, "time_limit": "2:00:0", "partition" : "development", "_comment": "that's all" }, "_comment": " that's all " } the error is:

sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 311, in make_vasp_relax make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 311, in make_vasp_relax args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 710, in gen_init_bulk os.remove(os.path.join(work_dir, 'INCAR' )) make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 311, in make_vasp_relax FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/INCAR' shutil.copy2( jdata['relax_incar'], File "/home1/04587/tfcao/miniconda3/lib/python3.9/shutil.py", line 436, in copy2 shutil.copy2( jdata['relax_incar'], File "/home1/04587/tfcao/miniconda3/lib/python3.9/shutil.py", line 436, in copy2 shutil.copy2( jdata['relax_incar'], File "/home1/04587/tfcao/miniconda3/lib/python3.9/shutil.py", line 436, in copy2 args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 710, in gen_init_bulk make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 311, in make_vasp_relax sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 311, in make_vasp_relax shutil.copy2( jdata['relax_incar'], File "/home1/04587/tfcao/miniconda3/lib/python3.9/shutil.py", line 436, in copy2 args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 710, in gen_init_bulk shutil.copy2( jdata['relax_incar'], File "/home1/04587/tfcao/miniconda3/lib/python3.9/shutil.py", line 436, in copy2 make_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 308, in make_vasp_relax os.remove(os.path.join(work_dir, 'INCAR' )) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/INCAR' Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 704, in gen_init_bulk make_super_cell_poscar(jdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 236, in make_super_cell_poscar copystat(src, dst, follow_symlinks=follow_symlinks) File "/home1/04587/tfcao/miniconda3/lib/python3.9/shutil.py", line 375, in copystat copystat(src, dst, follow_symlinks=follow_symlinks) File "/home1/04587/tfcao/miniconda3/lib/python3.9/shutil.py", line 375, in copystat copystat(src, dst, follow_symlinks=follow_symlinks) File "/home1/04587/tfcao/miniconda3/lib/python3.9/shutil.py", line 375, in copystat copystat(src, dst, follow_symlinks=follow_symlinks) File "/home1/04587/tfcao/miniconda3/lib/python3.9/shutil.py", line 375, in copystat copystat(src, dst, follow_symlinks=follow_symlinks) File "/home1/04587/tfcao/miniconda3/lib/python3.9/shutil.py", line 375, in copystat natoms_str = lines[6] IndexError: list index out of range copystat(src, dst, follow_symlinks=follow_symlinks) File "/home1/04587/tfcao/miniconda3/lib/python3.9/shutil.py", line 375, in copystat Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main lookup("utime")(dst, ns=(st.st_atime_ns, st.st_mtime_ns), FileNotFoundError: [Errno 2] No such file or directory lookup("utime")(dst, ns=(st.st_atime_ns, st.st_mtime_ns), FileNotFoundError: [Errno 2] No such file or directory

On Sun, Jun 20, 2021 at 10:03 PM AnguseZhang @.***> wrote:

"If the case is the former, you can specify "localhost" for "hostname" of machine.json and try again."

OK. It seems connection fails. Do you use Slurm on your current machine, or you're connecting a remote machine? If the case is the former, you can specify "localhost" for "hostname" of machine.json and try again.

Did you try this?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/438#issuecomment-864729628, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTFAPB5GJQIRJQ3Y2GDTT3B2TANCNFSM46536NUQ .

AnguseZhang commented 3 years ago

Did you use ibrun dpgen again? Please delete it. Otherwise we cannot figure out the reason.

tfcao888666 commented 3 years ago

It seems not related to ibrun, I delete it, the error is still there. { "deepmd_path": "~/miniconda3/bin/dp", "train_machine": { "machine_type": "slurm", "hostname" : "localhost", "port" : 22, "username": "tfcao", "work_path" : "/scratch/04587/tfcao/ch4-large/ini", "_comment" : "that's all" }, "train_resources": { "numb_node": 1, "task_per_node":48, "partition" : "skx-dev", "exclude_list" : [], "source_list": [ "~/miniconda3/bin/activate" ], "module_list": [ ], "time_limit": "1:00:0", "_comment": "that's all" },

"lmp_command":      "~/miniconda3/bin/lmp",
"model_devi_group_size":    1,
"_comment":         "model_devi on localhost",
"model_devi_machine":       {
    "machine_type": "slurm",
    "hostname" :    "localhost",
    "port" :        22,
    "username":     "tfcao",
    "work_path" :   "/scratch/04587/tfcao/ch4-large/ini",
    "_comment" :    "that's all"
},
"_comment": " if use GPU, numb_nodes(nn) should always be 1 ",
"_comment": " if numb_nodes(nn) = 1 multi-threading rather than mpi is

assumed", "model_devi_resources": { "numb_node": 1, "task_per_node":48, "source_list": ["~/miniconda3/bin/activate" ], "module_list": [ ], "time_limit": "1:00:0", "partition" : "skx-dev", "_comment": "that's all" },

"_comment":         "fp on localhost ",
"fp_command":       "/home1/04587/tfcao/vasp_bin/regular/vasp",
"fp_group_size":    1,
"fp_machine":       {
    "machine_type": "slurm",
    "hostname" :    "localhost",
    "port" :        22,
    "username":     "tfcao",
    "work_path" :   "/scratch/04587/tfcao/ch4-large/ini",
    "_comment" :    "that's all"
},
"fp_resources":     {
    "numb_node":    1,
    "task_per_node":48,
    "numb_gpu":     0,
    "exclude_list" : [],
    "source_list":  [],
    "module_list":  [],
    "with_mpi" : false,
    "time_limit":   "1:00:0",
    "partition" : "skx-dev",
    "_comment":     "that's all"
},
"_comment":         " that's all "

}

self._transport.auth_publickey(username, key)

File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/paramiko/transport.py", line 1580, in auth_publickey return self.auth_handler.wait_for_response(my_event) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/paramiko/auth_handler.py", line 250, in wait_for_response raise e paramiko.ssh_exception.AuthenticationException: Authentication failed.

On Mon, Jun 21, 2021 at 5:17 AM AnguseZhang @.***> wrote:

Did you use ibrun dpgen again? Please delete it. Otherwise we cannot figure out the reason.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/438#issuecomment-864986468, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTFVHFY523UN5ML2M73TT4UUXANCNFSM46536NUQ .

AnguseZhang commented 3 years ago

Try two methods.

  1. When you execute ssh username@localhost, do you need to type the password? If yes, type ssy-copy-id username@localhost and type the password. Then execute ssh username@localhost, and this time you should connect without password. Then, you can try running DP-GEN again.
  2. If this doesn't work, you can delete "hostname", "port" and "username" directly in machine.json. DP-GEN will git rid of using ssh, instead of local OS system.

Yuzhi

tfcao888666 commented 3 years ago

Hi Yuzi, I delete it. The error changes into Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 711, in gen_init_bulk run_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 586, in run_vasp_relax dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/06051a5b-3952-4154-b7d1-3ecb1861a017 && sbatch 06051a5b-3952-4154-b7d1-3ecb1861a017.sub', '06051a5b-3952-4154-b7d1-3ecb1861a017')) ~ "deepmd_path": "~/miniconda3/bin/dp", "train_machine": { "machine_type": "slurm", "work_path" : "/scratch/04587/tfcao/ch4-large/ini", "_comment" : "that's all" }, "train_resources": { "numb_node": 1, "task_per_node":48, "partition" : "skx-dev", "exclude_list" : [], "source_list": [ "~/miniconda3/bin/activate" ], "module_list": [ ], "time_limit": "1:00:0", "_comment": "that's all" },

"lmp_command":      "~/miniconda3/bin/lmp",
"model_devi_group_size":    1,
"_comment":         "model_devi on localhost",
"model_devi_machine":       {
    "machine_type": "slurm",
    "work_path" :   "/scratch/04587/tfcao/ch4-large/ini",
    "_comment" :    "that's all"
},
"_comment": " if use GPU, numb_nodes(nn) should always be 1 ",
"_comment": " if numb_nodes(nn) = 1 multi-threading rather than mpi is

assumed", "model_devi_resources": { "numb_node": 1, "task_per_node":48, "source_list": ["~/miniconda3/bin/activate" ], "module_list": [ ], "time_limit": "1:00:0", "partition" : "skx-dev", "_comment": "that's all" },

"_comment":         "fp on localhost ",
"fp_command":       "/home1/04587/tfcao/vasp_bin/regular/vasp",
"fp_group_size":    1,
"fp_machine":       {
    "machine_type": "slurm",
    "work_path" :   "/scratch/04587/tfcao/ch4-large/ini",
    "_comment" :    "that's all"
},
"fp_resources":     {
    "numb_node":    1,
    "task_per_node":48,
    "numb_gpu":     0,
    "exclude_list" : [],
    "source_list":  [],
    "module_list":  [],
    "with_mpi" : false,
    "time_limit":   "1:00:0",
    "partition" : "skx-dev",
    "_comment":     "that's all"
},
"_comment":         " that's all "

}

On Mon, Jun 21, 2021 at 6:37 AM AnguseZhang @.***> wrote:

Try two methods.

  1. When you execute ssh @., do you need to type the password? If yes, type ssy-copy-id @. and type the password. Then execute ssh @.***, and this time you should connect without password. Then, you can try running DP-GEN again.
  2. If this doesn't work, you can delete "hostname", "port" and "username" directly in machine.json. DP-GEN will git rid of using ssh, instead of local OS system.

Yuzhi

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/438#issuecomment-865039911, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTDVVYSSEI427J5ID23TT46CDANCNFSM46536NUQ .

AnguseZhang commented 3 years ago

Seems things proceeds. Your slurm settings have problem. Execute cd /scratch/04587/tfcao/ch4-large/ini/06051a5b-3952-4154-b7d1-3ecb1861a017 && sbatch 06051a5b-3952-4154-b7d1-3ecb1861a017.sub in command line, and see the error log.

tfcao888666 commented 3 years ago

Hi Yuzi, My slum setting works. I can run it by hand. See the following information. But it can not be submitted by the code. It is strange. Also there is an error:"FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/INCAR' os.remove(os.path.join(work_dir, 'INCAR' )) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/INCAR' os.remove(os.path.join(work_dir, 'INCAR' )) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/INCAR' " But the INCAR indeed existes there.

cd /scratch/04587/tfcao/ch4-large/ini/e56c7baa-f2ff-448b-b546-adde0ac82b91 && sbatch e56c7baa-f2ff-448b-b546-adde0ac82b91.sub


      Welcome to the Stampede2 Supercomputer

No reservation for this job --> Verifying valid submit host (login4)...OK --> Verifying valid jobname...OK --> Enforcing max jobs per user...OK --> Verifying availability of your home dir (/home1/04587/tfcao)...OK --> Verifying availability of your work2 dir (/work2/04587/tfcao/stampede2)...OK --> Verifying availability of your scratch dir (/scratch/04587/tfcao)...OK --> Verifying valid ssh keys...OK --> Verifying access to desired queue (normal)...OK --> Verifying job request is within current queue limits...OK --> Checking available allocation (TG-DMR160007)...OK --> Verifying that quota for filesystem /home1/04587/tfcao is at 79.39% allocated...OK --> Verifying that quota for filesystem /work2/04587/tfcao/stampede2 is at 74.87% allocated...OK Submitted batch job 7938202

On Mon, Jun 21, 2021 at 7:23 AM AnguseZhang @.***> wrote:

Seems things proceeds. Your slurm settings have problem. Execute cd /scratch/04587/tfcao/ch4-large/ini/06051a5b-3952-4154-b7d1-3ecb1861a017 && sbatch 06051a5b-3952-4154-b7d1-3ecb1861a017.sub in command line, and see the error log.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/438#issuecomment-865074379, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTBPMZDNK7ADVY3YTSDTT5DMNANCNFSM46536NUQ .

AnguseZhang commented 3 years ago

Please execute this script and tell me the result.

import subprocess as sp
cmd = "cd /scratch/04587/tfcao/ch4-large/ini/e56c7baa-f2ff-448b-b546-adde0ac82b91 && sbatch e56c7baa-f2ff-448b-b546-adde0ac82b91.sub"
proc = sp.Popen(cmd, shell=True, stdout = sp.PIPE, stderr = sp.PIPE)
o, e = proc.communicate()
print("Return code:",proc.returncode)
print("O:", o.decode('utf-8').splitlines()) 
print("E:",e.decode('utf-8').splitlines())
AnguseZhang commented 3 years ago

'/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/INCAR' os.remove(os.path.join(work_dir, 'INCAR' )) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/INCAR' os.remove(os.path.join(work_dir, 'INCAR' )) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/INCAR' "

Your description is quite confusing. For now, I understand there may be problem when submitting slurm scripts by DP-GEN, but how does this error occurs? Please describe how the error occurs and what's the executed command.

tfcao888666 commented 3 years ago

Hi Yuzi, I submit job with machine.json { "deepmd_path": "~/miniconda3/bin/dp", "train_machine": { "batch": "slurm", "work_path" : "/scratch/04587/tfcao/ch4-large/ini", "_comment" : "that's all" }, "train_resources": { "numb_node": 1, "task_per_node":64, "partition" : "normal", "exclude_list" : [], "source_list": [ "~/miniconda3/bin/activate" ], "module_list": [ ], "time_limit": "2:00:0", "mem_limit": 32, "_comment": "that's all" },

"lmp_command":      "~/miniconda3/bin/lmp",
"model_devi_group_size":    1,
"_comment":         "model_devi on localhost",
"model_devi_machine":       {
    "batch": "slurm",
    "work_path" :   "/scratch/04587/tfcao/ch4-large/ini",
    "_comment" :    "that's all"
},
"_comment": " if use GPU, numb_nodes(nn) should always be 1 ",
"_comment": " if numb_nodes(nn) = 1 multi-threading rather than mpi is

assumed", "model_devi_resources": { "numb_node": 1, "task_per_node":64, "source_list": ["~/miniconda3/bin/activate" ], "module_list": [ ], "time_limit": "2:00:0", "mem_limit": 32, "partition" : "normal", "_comment": "that's all" },

"_comment":         "fp on localhost ",
"fp_command":       "ibrun tacc_affinity

/home1/04587/tfcao/vasp_bin/regular/vasp", "fp_group_size": 1, "fp_machine": { "batch": "slurm", "work_path" : "/scratch/04587/tfcao/ch4-large/ini", "_comment" : "that's all" }, "fp_resources": { "numb_node": 1, "task_per_node":64, "numb_gpu": 0, "exclude_list" : [], "source_list": [], "module_list": [], "with_mpi" : false, "time_limit": "2:00:0", "partition" : "normal", "_comment": "that's all" }, "_comment": " that's all " } The job is submitted with "#!/bin/bash

SBATCH -J dpgen # Job name

SBATCH -o 16core_t.o%j # Name of stdout output file(%j expands to

jobId)

SBATCH -e 16core_t.e%j # Name of stderr output file(%j expands to

jobId)

SBATCH -p development # Submit to the 'normal' or

'development' queue

SBATCH -N 1 # Total number of nodes requested (16

cores/node)

SBATCH -n 64 # Total number of mpi tasks requested

SBATCH -t 2:00:00 # Run time (hh:mm:ss) - 24 hours

SBATCH -A TG-DMR160007

module load vasp

export FORT_BUFFERED=true

conda activate dpgenDev

ibrun tacc_affinity /home1/apps/intel18/impi18_0/qe/6.3/bin/pw.x

o3-scf.out

ibrun tacc_affinity nohup dpgen init_bulk param.json machine.json >

log.out ibrun tacc_affinity nohup dpgen init_bulk param.json machine.json > log.out

ibrun ./vasp_std >& result

" It generate a lot of folds "a8d7f28-bd60-41d3-8ca8-29dde2958622 2dfa4287-edd6-4930-8cca-3b07e058d5e0 c3c2d483-e1e8-4626-a5c9-71745bae51da INCAR_rlx POSCAR 120e0ee3-38f1-4283-a2d5-cfbb6dd6c8b8 3504d82d-cf50-411e-a898-e404b9f74e25 d561eb83-0fa1-4dfa-b857-574f24af85e6 log.out POSCAR.01x01x01 16core_t.e7938648 63cdb969-4efd-4dc2-9821-69979bdc2045 dpgen.log machine.json POT_C 16core_t.o7938648 646ce65f-cffa-477b-8e97-41043a552b0a fad15291-4f54-48ef-b0de-151bad13b9d3 machine.json-back POT_H 1d881101-e8a7-4907-be32-833c8cf01427 a3b903a7-7efe-4bc5-a421-ed76e3eb19f6 INCAR_md machine.json-back1 submit.sh 251684b0-82aa-47e9-9ddb-b5b2c5b35e55 b524be4b-3ac7-4f8c-97a5-27263b44fc19 INCAR_methane " But the job in each folder cannot be submitted by the python code. The log information is:

ile "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs run_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 586, in run_vasp_relax dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 711, in gen_init_bulk Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in args.func(args) job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 711, in gen_init_bulk dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 711, in gen_init_bulk self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main run_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 586, in run_vasp_relax run_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 586, in run_vasp_relax self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit run_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 586, in run_vasp_relax stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall dispatcher.run_jobs(fp_resources, dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 711, in gen_init_bulk stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit run_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 586, in run_vasp_relax rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/63cdb969-4efd-4dc2-9821-69979bdc2045 && sbatch 63cdb969-4efd-4dc2-9821-69979bdc2045.sub', '63cdb969-4efd-4dc2-9821-69979bdc2045')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/fad15291-4f54-48ef-b0de-151bad13b9d3 && sbatch fad15291-4f54-48ef-b0de-151bad13b9d3.sub', 'fad15291-4f54-48ef-b0de-151bad13b9d3')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/646ce65f-cffa-477b-8e97-41043a552b0a && sbatch 646ce65f-cffa-477b-8e97-41043a552b0a.sub', '646ce65f-cffa-477b-8e97-41043a552b0a')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/2dfa4287-edd6-4930-8cca-3b07e058d5e0 && sbatch 2dfa4287-edd6-4930-8cca-3b07e058d5e0.sub', '2dfa4287-edd6-4930-8cca-3b07e058d5e0')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/0a8d7f28-bd60-41d3-8ca8-29dde2958622 && sbatch 0a8d7f28-bd60-41d3-8ca8-29dde2958622.sub', '0a8d7f28-bd60-41d3-8ca8-29dde2958622')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/c3c2d483-e1e8-4626-a5c9-71745bae51da && sbatch c3c2d483-e1e8-4626-a5c9-71745bae51da.sub', 'c3c2d483-e1e8-4626-a5c9-71745bae51da')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/a3b903a7-7efe-4bc5-a421-ed76e3eb19f6 && sbatch a3b903a7-7efe-4bc5-a421-ed76e3eb19f6.sub', 'a3b903a7-7efe-4bc5-a421-ed76e3eb19f6')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/b524be4b-3ac7-4f8c-97a5-27263b44fc19 && sbatch b524be4b-3ac7-4f8c-97a5-27263b44fc19.sub', 'b524be4b-3ac7-4f8c-97a5-27263b44fc19')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/d561eb83-0fa1-4dfa-b857-574f24af85e6 && sbatch d561eb83-0fa1-4dfa-b857-574f24af85e6.sub', 'd561eb83-0fa1-4dfa-b857-574f24af85e6')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/1d881101-e8a7-4907-be32-833c8cf01427 && sbatch 1d881101-e8a7-4907-be32-833c8cf01427.sub', '1d881101-e8a7-4907-be32-833c8cf01427')) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/120e0ee3-38f1-4283-a2d5-cfbb6dd6c8b8 && sbatch 120e0ee3-38f1-4283-a2d5-cfbb6dd6c8b8.sub', '120e0ee3-38f1-4283-a2d5-cfbb6dd6c8b8')) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/3504d82d-cf50-411e-a898-e404b9f74e25 && sbatch 3504d82d-cf50-411e-a898-e404b9f74e25.sub', '3504d82d-cf50-411e-a898-e404b9f74e25')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/251684b0-82aa-47e9-9ddb-b5b2c5b35e55 && sbatch 251684b0-82aa-47e9-9ddb-b5b2c5b35e55.sub', '251684b0-82aa-47e9-9ddb-b5b2c5b35e55'))

On Mon, Jun 21, 2021 at 8:54 PM AnguseZhang @.***> wrote:

'/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/INCAR' os.remove(os.path.join(work_dir, 'INCAR' )) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/INCAR' os.remove(os.path.join(work_dir, 'INCAR' )) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/INCAR' "

Your description is quite confusing. For now, I understand there may be problem when submitting slurm scripts by DP-GEN, but how does this error occurs? Please describe how the error occurs and what's the executed command.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/438#issuecomment-865507530, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTFLT5FS5OLJFC3KFD3TUACQ3ANCNFSM46536NUQ .

tfcao888666 commented 3 years ago

It also has some error information in the log file. FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/POTCAR' os.remove(os.path.join(work_dir, 'INCAR' )) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/INCAR' os.remove(os.path.join(work_dir, 'POTCAR')) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/POTCAR' natoms_str = lines[6] IndexError: list index out of range natoms_str = lines[6] IndexError: list index out of range natoms_str = lines[6]

But all these files exit in the 00.place_ele folder. It is strange it prints out such error.

On Mon, Jun 21, 2021 at 11:02 PM Tengfei Cao @.***> wrote:

Hi Yuzi, I submit job with machine.json { "deepmd_path": "~/miniconda3/bin/dp", "train_machine": { "batch": "slurm", "work_path" : "/scratch/04587/tfcao/ch4-large/ini", "_comment" : "that's all" }, "train_resources": { "numb_node": 1, "task_per_node":64, "partition" : "normal", "exclude_list" : [], "source_list": [ "~/miniconda3/bin/activate" ], "module_list": [ ], "time_limit": "2:00:0", "mem_limit": 32, "_comment": "that's all" },

"lmp_command":      "~/miniconda3/bin/lmp",
"model_devi_group_size":    1,
"_comment":         "model_devi on localhost",
"model_devi_machine":       {
    "batch": "slurm",
    "work_path" :   "/scratch/04587/tfcao/ch4-large/ini",
    "_comment" :    "that's all"
},
"_comment": " if use GPU, numb_nodes(nn) should always be 1 ",
"_comment": " if numb_nodes(nn) = 1 multi-threading rather than mpi is

assumed", "model_devi_resources": { "numb_node": 1, "task_per_node":64, "source_list": ["~/miniconda3/bin/activate" ], "module_list": [ ], "time_limit": "2:00:0", "mem_limit": 32, "partition" : "normal", "_comment": "that's all" },

"_comment":         "fp on localhost ",
"fp_command":       "ibrun tacc_affinity

/home1/04587/tfcao/vasp_bin/regular/vasp", "fp_group_size": 1, "fp_machine": { "batch": "slurm", "work_path" : "/scratch/04587/tfcao/ch4-large/ini", "_comment" : "that's all" }, "fp_resources": { "numb_node": 1, "task_per_node":64, "numb_gpu": 0, "exclude_list" : [], "source_list": [], "module_list": [], "with_mpi" : false, "time_limit": "2:00:0", "partition" : "normal", "_comment": "that's all" }, "_comment": " that's all " } The job is submitted with "#!/bin/bash

SBATCH -J dpgen # Job name

SBATCH -o 16core_t.o%j # Name of stdout output file(%j expands to

jobId)

SBATCH -e 16core_t.e%j # Name of stderr output file(%j expands to

jobId)

SBATCH -p development # Submit to the 'normal' or

'development' queue

SBATCH -N 1 # Total number of nodes requested (16

cores/node)

SBATCH -n 64 # Total number of mpi tasks requested

SBATCH -t 2:00:00 # Run time (hh:mm:ss) - 24 hours

SBATCH -A TG-DMR160007

module load vasp

export FORT_BUFFERED=true

conda activate dpgenDev

ibrun tacc_affinity /home1/apps/intel18/impi18_0/qe/6.3/bin/pw.x <

o3-scf.in> o3-scf.out

ibrun tacc_affinity nohup dpgen init_bulk param.json machine.json >

log.out ibrun tacc_affinity nohup dpgen init_bulk param.json machine.json > log.out

ibrun ./vasp_std >& result

" It generate a lot of folds "a8d7f28-bd60-41d3-8ca8-29dde2958622 2dfa4287-edd6-4930-8cca-3b07e058d5e0 c3c2d483-e1e8-4626-a5c9-71745bae51da INCAR_rlx POSCAR 120e0ee3-38f1-4283-a2d5-cfbb6dd6c8b8 3504d82d-cf50-411e-a898-e404b9f74e25 d561eb83-0fa1-4dfa-b857-574f24af85e6 log.out POSCAR.01x01x01 16core_t.e7938648 63cdb969-4efd-4dc2-9821-69979bdc2045 dpgen.log machine.json POT_C 16core_t.o7938648 646ce65f-cffa-477b-8e97-41043a552b0a fad15291-4f54-48ef-b0de-151bad13b9d3 machine.json-back POT_H 1d881101-e8a7-4907-be32-833c8cf01427 a3b903a7-7efe-4bc5-a421-ed76e3eb19f6 INCAR_md machine.json-back1 submit.sh 251684b0-82aa-47e9-9ddb-b5b2c5b35e55 b524be4b-3ac7-4f8c-97a5-27263b44fc19 INCAR_methane " But the job in each folder cannot be submitted by the python code. The log information is:

ile "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs run_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 586, in run_vasp_relax dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 711, in gen_init_bulk Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in args.func(args) job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 711, in gen_init_bulk dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 711, in gen_init_bulk self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main run_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 586, in run_vasp_relax run_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 586, in run_vasp_relax self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit run_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 586, in run_vasp_relax stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall dispatcher.run_jobs(fp_resources, dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 711, in gen_init_bulk stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit run_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 586, in run_vasp_relax rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/63cdb969-4efd-4dc2-9821-69979bdc2045 && sbatch 63cdb969-4efd-4dc2-9821-69979bdc2045.sub', '63cdb969-4efd-4dc2-9821-69979bdc2045')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/fad15291-4f54-48ef-b0de-151bad13b9d3 && sbatch fad15291-4f54-48ef-b0de-151bad13b9d3.sub', 'fad15291-4f54-48ef-b0de-151bad13b9d3')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/646ce65f-cffa-477b-8e97-41043a552b0a && sbatch 646ce65f-cffa-477b-8e97-41043a552b0a.sub', '646ce65f-cffa-477b-8e97-41043a552b0a')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/2dfa4287-edd6-4930-8cca-3b07e058d5e0 && sbatch 2dfa4287-edd6-4930-8cca-3b07e058d5e0.sub', '2dfa4287-edd6-4930-8cca-3b07e058d5e0')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/0a8d7f28-bd60-41d3-8ca8-29dde2958622 && sbatch 0a8d7f28-bd60-41d3-8ca8-29dde2958622.sub', '0a8d7f28-bd60-41d3-8ca8-29dde2958622')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/c3c2d483-e1e8-4626-a5c9-71745bae51da && sbatch c3c2d483-e1e8-4626-a5c9-71745bae51da.sub', 'c3c2d483-e1e8-4626-a5c9-71745bae51da')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/a3b903a7-7efe-4bc5-a421-ed76e3eb19f6 && sbatch a3b903a7-7efe-4bc5-a421-ed76e3eb19f6.sub', 'a3b903a7-7efe-4bc5-a421-ed76e3eb19f6')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/b524be4b-3ac7-4f8c-97a5-27263b44fc19 && sbatch b524be4b-3ac7-4f8c-97a5-27263b44fc19.sub', 'b524be4b-3ac7-4f8c-97a5-27263b44fc19')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/d561eb83-0fa1-4dfa-b857-574f24af85e6 && sbatch d561eb83-0fa1-4dfa-b857-574f24af85e6.sub', 'd561eb83-0fa1-4dfa-b857-574f24af85e6')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/1d881101-e8a7-4907-be32-833c8cf01427 && sbatch 1d881101-e8a7-4907-be32-833c8cf01427.sub', '1d881101-e8a7-4907-be32-833c8cf01427')) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/120e0ee3-38f1-4283-a2d5-cfbb6dd6c8b8 && sbatch 120e0ee3-38f1-4283-a2d5-cfbb6dd6c8b8.sub', '120e0ee3-38f1-4283-a2d5-cfbb6dd6c8b8')) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/3504d82d-cf50-411e-a898-e404b9f74e25 && sbatch 3504d82d-cf50-411e-a898-e404b9f74e25.sub', '3504d82d-cf50-411e-a898-e404b9f74e25')) raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/251684b0-82aa-47e9-9ddb-b5b2c5b35e55 && sbatch 251684b0-82aa-47e9-9ddb-b5b2c5b35e55.sub', '251684b0-82aa-47e9-9ddb-b5b2c5b35e55'))

On Mon, Jun 21, 2021 at 8:54 PM AnguseZhang @.***> wrote:

'/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/INCAR' os.remove(os.path.join(work_dir, 'INCAR' )) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/INCAR' os.remove(os.path.join(work_dir, 'INCAR' )) FileNotFoundError: [Errno 2] No such file or directory: '/scratch/04587/tfcao/ch4-large/ini/POSCAR.01x01x01/00.place_ele/INCAR' "

Your description is quite confusing. For now, I understand there may be problem when submitting slurm scripts by DP-GEN, but how does this error occurs? Please describe how the error occurs and what's the executed command.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/438#issuecomment-865507530, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTFLT5FS5OLJFC3KFD3TUACQ3ANCNFSM46536NUQ .

AnguseZhang commented 3 years ago

ibrun tacc_affinity nohup dpgen init_bulk param.json machine.json > log.out

Actually I do not know what to say and feel frustrated . I have explained TWICE before in this issue that please DO NOT use ibrun to run DP-GEN. DP-GEN's main program does not need parallel running, and once there is some problem, the error log will be a mess, like currently the long long information you provide. We can not figure out the reason in this case.

tfcao888666 commented 3 years ago

Hi Yuzi, I have tested it. It is not the issue of ibrun. Even I delete it, the error is the same.

On Mon, Jun 21, 2021 at 9:15 PM AnguseZhang @.***> wrote:

ibrun tacc_affinity nohup dpgen init_bulk param.json machine.json > log.out

Actually I do not know what to say and feel frustrated . In this issue before, I have explained TWICE before in this issue that please DO NOT use ibrun to run DP-GEN. DP-GEN's main program does not need parallel running, and once there is some problem, the error log will be a mess, like currently the long long information you provide. We can not figure out the reason in this case.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/438#issuecomment-865514527, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTHW6HG2G456LKLGVEDTUAE63ANCNFSM46536NUQ .

AnguseZhang commented 3 years ago

Hi Yuzi, I have tested it. It is not the issue of ibrun. Even I delete it, the error is the same. On Mon, Jun 21, 2021 at 9:15 PM AnguseZhang @.***> wrote: ibrun tacc_affinity nohup dpgen init_bulk param.json machine.json > log.out Actually I do not know what to say and feel frustrated . In this issue before, I have explained TWICE before in this issue that please DO NOT use ibrun to run DP-GEN. DP-GEN's main program does not need parallel running, and once there is some problem, the error log will be a mess, like currently the long long information you provide. We can not figure out the reason in this case. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#438 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTHW6HG2G456LKLGVEDTUAE63ANCNFSM46536NUQ .

You directly run dpgen run init_bulk.json machine.json and it will generate so many error logs?

AnguseZhang commented 3 years ago

Please execute this script and tell me the result.

import subprocess as sp
cmd = "cd /scratch/04587/tfcao/ch4-large/ini/e56c7baa-f2ff-448b-b546-adde0ac82b91 && sbatch e56c7baa-f2ff-448b-b546-adde0ac82b91.sub"
proc = sp.Popen(cmd, shell=True, stdout = sp.PIPE, stderr = sp.PIPE)
o, e = proc.communicate()
print("Return code:",proc.returncode)
print("O:", o.decode('utf-8').splitlines()) 
print("E:",e.decode('utf-8').splitlines())

Did you try this? What's the result?

tfcao888666 commented 3 years ago

!/bin/bash

SBATCH -J dpgen # Job name

SBATCH -o 16core_t.o%j # Name of stdout output file(%j expands to

jobId)

SBATCH -e 16core_t.e%j # Name of stderr output file(%j expands to

jobId)

SBATCH -p development # Submit to the 'normal' or

'development' queue

SBATCH -N 1 # Total number of nodes requested (16

cores/node)

SBATCH -n 64 # Total number of mpi tasks requested

SBATCH -t 2:00:00 # Run time (hh:mm:ss) - 24 hours

SBATCH -A TG-DMR160007

module load vasp

export FORT_BUFFERED=true

conda activate dpgenDev

ibrun tacc_affinity /home1/apps/intel18/impi18_0/qe/6.3/bin/pw.x

o3-scf.out

ibrun tacc_affinity nohup dpgen init_bulk param.json machine.json >

log.out nohup dpgen init_bulk param.json machine.json > log.out

ibrun ./vasp_std >& result

File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/950dc254-e5bc-4cc2-ab0f-c6094fe74f3f && sbatch 950dc254-e5bc-4cc2-ab0f-c6094fe74f3f.sub', '950dc254-e5bc-4cc2-ab0f-c6094fe74f3f'))

On Mon, Jun 21, 2021 at 11:28 PM Tengfei Cao @.***> wrote:

Hi Yuzi, I have tested it. It is not the issue of ibrun. Even I delete it, the error is the same.

On Mon, Jun 21, 2021 at 9:15 PM AnguseZhang @.***> wrote:

ibrun tacc_affinity nohup dpgen init_bulk param.json machine.json > log.out

Actually I do not know what to say and feel frustrated . In this issue before, I have explained TWICE before in this issue that please DO NOT use ibrun to run DP-GEN. DP-GEN's main program does not need parallel running, and once there is some problem, the error log will be a mess, like currently the long long information you provide. We can not figure out the reason in this case.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/438#issuecomment-865514527, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTHW6HG2G456LKLGVEDTUAE63ANCNFSM46536NUQ .

tfcao888666 commented 3 years ago

How to try this?

ode('utf-8').splitlines())

which subroutine should I put in. If I directly run "dpgen init_bulk param.json machine.json"

The error is " base) login3.stampede2(1178)$ dpgen init_bulk param.json machine.json

DeepModeling

Version: 0.9.3.dev17+gcc816d5 Date: Jun-20-2021 Path: /home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen

Reference

Please cite: Yuzhi Zhang, Haidi Wang, Weijie Chen, Jinzhe Zeng, Linfeng Zhang, Han Wang, and Weinan E, DP-GEN: A concurrent learning platform for the generation of reliable deep learning based potential energy models, Computer Physics Communications, 2020, 107206.

Description

Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 711, in gen_init_bulk run_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 586, in run_vasp_relax dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 41, in do_submit job_id = subret[0].split()[-1] IndexError: list index out of range

"

On Mon, Jun 21, 2021 at 9:32 PM AnguseZhang @.***> wrote:

ode('utf-8').splitlines())

Did you try this? What's the result?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/438#issuecomment-865520237, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTGUC3SWBTRLTMC3Z7DTUAG4JANCNFSM46536NUQ .

AnguseZhang commented 3 years ago

How to try this? ode('utf-8').splitlines()) which subroutine should I put in. If I directly run "dpgen init_bulk param.json machine.json" The error is " base) login3.stampede2(1178)$ dpgen init_bulk param.json machine.json DeepModeling ------------ Version: 0.9.3.dev17+gcc816d5 Date: Jun-20-2021 Path: /home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen Reference ------------ Please cite: Yuzhi Zhang, Haidi Wang, Weijie Chen, Jinzhe Zeng, Linfeng Zhang, Han Wang, and Weinan E, DP-GEN: A concurrent learning platform for the generation of reliable deep learning based potential energy models, Computer Physics Communications, 2020, 107206. ------------ Description ------------ Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 711, in gen_init_bulk run_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 586, in run_vasp_relax dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 41, in do_submit job_id = subret[0].split()[-1] IndexError: list index out of range " On Mon, Jun 21, 2021 at 9:32 PM AnguseZhang @.***> wrote: ode('utf-8').splitlines()) Did you try this? What's the result? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#438 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTGUC3SWBTRLTMC3Z7DTUAG4JANCNFSM46536NUQ .

That is a python script. Write those codes into a python script, like test.py and run python test.py.

AnguseZhang commented 3 years ago

You Slurm doesn't have a normal configuration. After you sbatch xxx.sub, there will be a lot of useless extra information, which will affect DP-GEN reading the job_id of the job you just submit. Can you ask the admin whether this extra information can be deleted?

njzjz commented 3 years ago

Can you append #SBATCH --parsable to your script, and see what the output will be.

--parsable Outputs only the job id number and the cluster name if present. The values are separated by a semicolon. Errors will still be displayed.

https://slurm.schedmd.com/sbatch.html

tfcao888666 commented 3 years ago

!/bin/bash

SBATCH -J dpgen # Job name

SBATCH -o 16core_t.o%j # Name of stdout output file(%j expands to

jobId)

SBATCH -e 16core_t.e%j # Name of stderr output file(%j expands to

jobId)

SBATCH -p development # Submit to the 'normal' or

'development' queue

SBATCH -N 1 # Total number of nodes requested (16

cores/node)

SBATCH -n 64 # Total number of mpi tasks requested

SBATCH -t 2:00:00 # Run time (hh:mm:ss) - 24 hours

SBATCH -A TG-DMR160007

SBATCH --parsable

module load vasp

export FORT_BUFFERED=true

conda activate dpgenDev

ibrun tacc_affinity /home1/apps/intel18/impi18_0/qe/6.3/bin/pw.x

o3-scf.out

ibrun tacc_affinity nohup dpgen init_bulk param.json machine.json >

log.out nohup dpgen init_bulk param.json machine.json > log.out

ibrun ./vasp_std >& result

~ The output is : raceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 711, in gen_init_bulk run_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 586, in run_vasp_relax dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/86ae7644-4be1-4af9-9ccc-a80c355561f5 && sbatch 86ae7644-4be1-4af9-9ccc-a80c355561f5.sub', '86ae7644-4be1-4af9-9ccc-a80c355561f5'))

On Mon, Jun 21, 2021 at 10:18 PM Jinzhe Zeng @.***> wrote:

Can you append #SBATCH --parsable to your script, and see what the output will be.

--parsable Outputs only the job id number and the cluster name if present. The values are separated by a semicolon. Errors will still be displayed.

https://slurm.schedmd.com/sbatch.html

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/438#issuecomment-865589670, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTCLAMGIYMJRMS3OIZLTUAMKLANCNFSM46536NUQ .

tfcao888666 commented 3 years ago

Hi All, If I run dpgen in the login node, with the command: "dpgen init_bulk param.json machine.json > log.out ", the job can run and no problems. However if I submit it with scitipt "

!/bin/bash

SBATCH -J dpgen # Job name

SBATCH -o 16core_t.o%j # Name of stdout output file(%j expands to

jobId)

SBATCH -e 16core_t.e%j # Name of stderr output file(%j expands to

jobId)

SBATCH -p development # Submit to the 'normal' or

'development' queue

SBATCH -N 1 # Total number of nodes requested (16

cores/node)

SBATCH -n 64 # Total number of mpi tasks requested

SBATCH -t 1:00:00 # Run time (hh:mm:ss) - 24 hours

SBATCH -A TG-DMR160007

SBATCH --parsable

module load vasp

export FORT_BUFFERED=true

conda activate dpgenDev

ibrun tacc_affinity /home1/apps/intel18/impi18_0/qe/6.3/bin/pw.x

o3-scf.out

ibrun tacc_affinity nohup dpgen init_bulk param.json machine.json > log.out

dpgen init_bulk param.json machine.json > log.out

ibrun ./vasp_std >& result

~ " It still has this error: "Traceback (most recent call last): File "/home1/04587/tfcao/.local/bin/dpgen", line 8, in sys.exit(main()) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/main.py", line 175, in main args.func(args) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 711, in gen_init_bulk run_vasp_relax(jdata, mdata) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/data/gen.py", line 586, in run_vasp_relax dispatcher.run_jobs(fp_resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 80, in run_jobs job_handler = self.submit_jobs(resources, File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Dispatcher.py", line 150, in submit_jobs rjob['batch'].submit(cur_chunk, command, res = resources, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Batch.py", line 123, in submit self.do_submit(job_dirs, cmd, args, res, outlog=outlog, errlog=errlog) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/Slurm.py", line 39, in do_submit stdin, stdout, stderr = self.context.block_checkcall('cd %s && %s %s' % (self.context.remote_root, 'sbatch', self.sub_script_name)) File "/home1/04587/tfcao/.local/lib/python3.9/site-packages/dpgen/dispatcher/LocalContext.py", line 147, in block_checkcall raise RuntimeError("Get error code %d in locally calling %s with job: %s ", (code, cmd, self.job_uuid)) RuntimeError: ('Get error code %d in locally calling %s with job: %s ', (1, 'cd /scratch/04587/tfcao/ch4-large/ini/64190664-50f5-4037-97a7-0a6885893d9c && sbatch 64190664-50f5-4037-97a7-0a6885893d9c.sub', '64190664-50f5-4037-97a7-0a6885893d9c')) " Sorry, for bother you! Thanks!

On Mon, Jun 21, 2021 at 10:18 PM Jinzhe Zeng @.***> wrote:

Can you append #SBATCH --parsable to your script, and see what the output will be.

--parsable Outputs only the job id number and the cluster name if present. The values are separated by a semicolon. Errors will still be displayed.

https://slurm.schedmd.com/sbatch.html

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepmodeling/dpgen/issues/438#issuecomment-865589670, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQHBQTCLAMGIYMJRMS3OIZLTUAMKLANCNFSM46536NUQ .

AnguseZhang commented 3 years ago

This issue is solved. I've closed this issue. If there is still any problem, you can reopen this issue or create a new issue.