deepmodeling / dpgen

The deep potential generator to generate a deep-learning based model of interatomic potential energy and force field
https://docs.deepmodeling.com/projects/dpgen/
GNU Lesser General Public License v3.0
296 stars 173 forks source link

The problem of dpgen on Slurm system #360

Closed zcb-code closed 3 years ago

zcb-code commented 3 years ago

I do not know where is wrong, the machine file or the installation? The detail can be seen below:


my install methods: wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2020.02-Linux-x86_64.sh conda create -n deepc python=3.6 libprotobuf==3.8.0 conda activate deepc conda install deepmd-kit==cpu lammps-dp==cpu -c deepmodeling pip install pymatgen==2019.6.5 monty==2.0.4 ase==3.17.0 paramiko==2.6.0 custodian==2019.2.10 dpgen==0.8.1


No error after installation. The CH4 example can be finished on my Desktop computer with no errors. It did not go on well on supercomputer system. The vision of our supersystem is : CentOS Linux release 7.7.1908 (Core).

I run the CH4 example on slurm system by: dpgen run param.json machine-slurm.json >log 2>error the record.dpgen: 0 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 1 0


the error file : Traceback (most recent call last): File "/home/elgao/scratch/anaconda3_bob/envs/deepc/bin/dpgen", line 8, in sys.exit(main()) File "/home/elgao/scratch/anaconda3_bob/envs/deepc/lib/python3.6/site-packages/dpgen/main.py", line 182, in main args.func(args) File "/home/elgao/scratch/anaconda3_bob/envs/deepc/lib/python3.6/site-packages/dpgen/generator/run.py", line 2340, in gen_run run_iter (args.PARAM, args.MACHINE) File "/home/elgao/scratch/anaconda3_bob/envs/deepc/lib/python3.6/site-packages/dpgen/generator/run.py", line 2303, in run_iter run_train (ii, jdata, mdata) File "/home/elgao/scratch/anaconda3_bob/envs/deepc/lib/python3.6/site-packages/dpgen/generator/run.py", line 530, in run_train errlog = 'train.log') File "/home/elgao/scratch/anaconda3_bob/envs/deepc/lib/python3.6/site-packages/dpgen/dispatcher/Dispatcher.py", line 91, in run_jobs while not self.all_finished(job_handler, mark_failure) : File "/home/elgao/scratch/anaconda3_bob/envs/deepc/lib/python3.6/site-packages/dpgen/dispatcher/Dispatcher.py", line 216, in all_finished raise RuntimeError('Job %s failed for more than 3 times' % job_uuid) RuntimeError: Job f01bba9e-181e-4fad-8b96-c61db8350cf5 failed for more than 3 times


the dpgen.log file: 2021-04-01 17:08:16,365 - INFO : =============================iter.000001============================== 2021-04-01 17:08:16,365 - INFO : -------------------------iter.000001 task 00-------------------------- 2021-04-01 17:08:16,407 - INFO : -------------------------iter.000001 task 01-------------------------- 2021-04-01 17:08:16,457 - INFO : new submission of f01bba9e-181e-4fad-8b96-c61db8350cf5 for chunk 8aefb06c426e07a0a671a1e2488b4858d694a730 2021-04-01 17:08:16,530 - INFO : new submission of f9cf7153-403f-4f3e-9c4e-928aae9010a8 for chunk e193a01ecf8d30ad0affefd332ce934e32ffce72 2021-04-01 17:08:16,576 - INFO : new submission of 9ac8ba18-fbda-4e38-98c4-a97637433a5f for chunk 6fc978af728d43c59faa400d5f6e0471ac850d4c 2021-04-01 17:08:16,618 - INFO : new submission of 604657a5-597a-4185-9b24-b6f62216e38b for chunk 221407c03ae5c73109cce71d27e24637824f3333 2021-04-01 17:09:16,797 - INFO : job f01bba9e-181e-4fad-8b96-c61db8350cf5 terminated, submit again 2021-04-01 17:09:16,875 - INFO : job f9cf7153-403f-4f3e-9c4e-928aae9010a8 terminated, submit again 2021-04-01 17:09:16,950 - INFO : job 9ac8ba18-fbda-4e38-98c4-a97637433a5f terminated, submit again 2021-04-01 17:09:17,013 - INFO : job 604657a5-597a-4185-9b24-b6f62216e38b terminated, submit again 2021-04-01 17:10:17,192 - INFO : job f01bba9e-181e-4fad-8b96-c61db8350cf5 terminated, submit again 2021-04-01 17:10:17,278 - INFO : job f9cf7153-403f-4f3e-9c4e-928aae9010a8 terminated, submit again 2021-04-01 17:10:17,344 - INFO : job 9ac8ba18-fbda-4e38-98c4-a97637433a5f terminated, submit again 2021-04-01 17:10:17,402 - INFO : job 604657a5-597a-4185-9b24-b6f62216e38b terminated, submit again 2021-04-01 17:11:17,538 - INFO : job f01bba9e-181e-4fad-8b96-c61db8350cf5 terminated, submit again 2021-04-01 17:11:17,653 - INFO : job f9cf7153-403f-4f3e-9c4e-928aae9010a8 terminated, submit again 2021-04-01 17:11:17,717 - INFO : job 9ac8ba18-fbda-4e38-98c4-a97637433a5f terminated, submit again 2021-04-01 17:11:17,786 - INFO : job 604657a5-597a-4185-9b24-b6f62216e38b terminated, submit again


In the f01bba9e-181e-4fad-8b96-c61db8350cf5 folder, four slurm-*.out file have the same content: /home/XXX/scratch/anaconda3_bob/bin/conda: line 3: import: command not found /home/XXX/scratch/anaconda3_bob/bin/conda: line 6: syntax error near unexpected token sys.argv' /home/XXX/scratch/anaconda3_bob/bin/conda: line 6:if len(sys.argv) > 1 and sys.argv[1].startswith('shell.') and sys.path and sys.path[0] == '':'


the file of /home/XXX/scratch/anaconda3_bob/bin/conda:

!/home/elgao/scratch/anaconda3_bob/bin/python

-- coding: utf-8 --

import sys

Before any more imports, leave cwd out of sys.path for internal 'conda shell.*' commands.

see https://github.com/conda/conda/issues/6549

if len(sys.argv) > 1 and sys.argv[1].startswith('shell.') and sys.path and sys.path[0] == '':

The standard first entry in sys.path is an empty string,

# and os.path.abspath('') expands to os.getcwd().
del sys.path[0]

if name == 'main': from conda.cli import main sys.exit(main())


the machine file machine-slurm.json : { "train": [ { "machine": { "batch": "slurm", "work_path": "/home/XXX/scratch/deepmd_bob/dpwork" }, "resources": { "numb_node": 1, "numb_gpu": 0, "task_per_node": 8, "with_mpi": false, "name": "dp_pj", "partition": "pub", "time_limit": "3600:00:00", "exclude_list": [], "source_list": [ " ~/.bashrc","conda activate deepc" ], "module_list": [] }, "command": "dp", "group size": 1 } ], "model_devi": [ { "machine": { "batch": "slurm", "work_path": "/home/XXX/scratch/deepmd_bob/dpwork" }, "resources": { "numb_node": 1, "numb_gpu": 0, "task_per_node": 16, "with_mpi": false, "partition": "pub", "name":"lmp", "time_limit": "3600:00:00", "exclude_list": [], "source_list": [ " ~/.bashrc","conda activate deepc"], "module_list": [] }, "command": "mpirun -np 8 /scratch/XXX/anaconda3_bob/envs/deepc/bin/lmp ", "group_size": 2 } ], "fp": [ { "machine": { "batch": "slurm", "work_path": "/home/XXX/scratch/deepmd_bob/dpwork" }, "resources": { "numb_node": 1, "numb_gpu": 0, "task_per_node": 16, "exclude_list": [], "with_mpi": false, "name": "aimd_zcb", "source_list": [], "module_list": [ "module load intel" ], "partition": "pub", "time_limit": "3600:00:00", "_comment": "that's All" }, "command": " srun -n 16 /project/XXX/00_apps/vasp6.1_isif_vtst_bob", "group_size": 10 } ] }


Please help me out of this.

AnguseZhang commented 3 years ago

Thanks for your report and necessary information.

First, as DP-GEN reports, "2021-04-01 17:10:17,402 - INFO : job 604657a5-597a-4185-9b24-b6f62216e38b terminated, submit again", a DeePMD-kit job fails. So you merely need to figure out what happens on your DeePMD-kit job, which means that you don't need to worry about DP-GEN.

Second, you enter the path of your DeePMD-kit jobs, and you will see a file, named "xxxx.sub". What you need is to try to modify this script to be submitted to Slurm.

Third, one common error related to conda is its activation. If you specify "source_list" in machine.json as " [ ~/.bashrc","conda activate deepc"]," in your "xxx.sub" file, it may write like "source ~/.bashrc" and "source conda activate deepc", right? So you can replace it with "activate deepc" and in your "xxx.sub" file, it should write "source activate deepc". Then you can manually "sbatch" this script and see whether DeePMD-kit can work.

AnguseZhang commented 3 years ago

I also have a question. Since your record.dpgen demonstrates that the first iteration (iter0) has successfully finished, what's the reason that iter1 fails? What's difference of the environment of machine file ?

zcb-code commented 3 years ago

Yes, after I replace the source conda activate deepc, to conda activate deepc,it wok. Thanks.

zcb-code commented 3 years ago

I also have a question. Since your record.dpgen demonstrates that the first iteration (iter0) has successfully finished, what's the reason that iter1 fails? What's difference of the environment of machine file ? I have rerun from the 0 3, I found " module load module load intel" , but it worked well with VASP.

AnguseZhang commented 3 years ago

This issue is solved. I've closed this issue. If there is still any problem, you can reopen this issue or create a new issue.