Closed zcb-code closed 3 years ago
Thanks for your report and necessary information.
First, as DP-GEN reports, "2021-04-01 17:10:17,402 - INFO : job 604657a5-597a-4185-9b24-b6f62216e38b terminated, submit again", a DeePMD-kit job fails. So you merely need to figure out what happens on your DeePMD-kit job, which means that you don't need to worry about DP-GEN.
Second, you enter the path of your DeePMD-kit jobs, and you will see a file, named "xxxx.sub". What you need is to try to modify this script to be submitted to Slurm.
Third, one common error related to conda is its activation. If you specify "source_list" in machine.json as " [ ~/.bashrc","conda activate deepc"]," in your "xxx.sub" file, it may write like "source ~/.bashrc" and "source conda activate deepc", right? So you can replace it with "activate deepc" and in your "xxx.sub" file, it should write "source activate deepc". Then you can manually "sbatch" this script and see whether DeePMD-kit can work.
I also have a question. Since your record.dpgen demonstrates that the first iteration (iter0) has successfully finished, what's the reason that iter1 fails? What's difference of the environment of machine file ?
Yes, after I replace the source conda activate deepc, to conda activate deepc,it wok. Thanks.
I also have a question. Since your record.dpgen demonstrates that the first iteration (iter0) has successfully finished, what's the reason that iter1 fails? What's difference of the environment of machine file ? I have rerun from the 0 3, I found " module load module load intel" , but it worked well with VASP.
This issue is solved. I've closed this issue. If there is still any problem, you can reopen this issue or create a new issue.
I do not know where is wrong, the machine file or the installation? The detail can be seen below:
my install methods: wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2020.02-Linux-x86_64.sh conda create -n deepc python=3.6 libprotobuf==3.8.0 conda activate deepc conda install deepmd-kit==cpu lammps-dp==cpu -c deepmodeling pip install pymatgen==2019.6.5 monty==2.0.4 ase==3.17.0 paramiko==2.6.0 custodian==2019.2.10 dpgen==0.8.1
No error after installation. The CH4 example can be finished on my Desktop computer with no errors. It did not go on well on supercomputer system. The vision of our supersystem is : CentOS Linux release 7.7.1908 (Core).
I run the CH4 example on slurm system by: dpgen run param.json machine-slurm.json >log 2>error the record.dpgen: 0 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 1 0
the error file : Traceback (most recent call last): File "/home/elgao/scratch/anaconda3_bob/envs/deepc/bin/dpgen", line 8, in
sys.exit(main())
File "/home/elgao/scratch/anaconda3_bob/envs/deepc/lib/python3.6/site-packages/dpgen/main.py", line 182, in main
args.func(args)
File "/home/elgao/scratch/anaconda3_bob/envs/deepc/lib/python3.6/site-packages/dpgen/generator/run.py", line 2340, in gen_run
run_iter (args.PARAM, args.MACHINE)
File "/home/elgao/scratch/anaconda3_bob/envs/deepc/lib/python3.6/site-packages/dpgen/generator/run.py", line 2303, in run_iter
run_train (ii, jdata, mdata)
File "/home/elgao/scratch/anaconda3_bob/envs/deepc/lib/python3.6/site-packages/dpgen/generator/run.py", line 530, in run_train
errlog = 'train.log')
File "/home/elgao/scratch/anaconda3_bob/envs/deepc/lib/python3.6/site-packages/dpgen/dispatcher/Dispatcher.py", line 91, in run_jobs
while not self.all_finished(job_handler, mark_failure) :
File "/home/elgao/scratch/anaconda3_bob/envs/deepc/lib/python3.6/site-packages/dpgen/dispatcher/Dispatcher.py", line 216, in all_finished
raise RuntimeError('Job %s failed for more than 3 times' % job_uuid)
RuntimeError: Job f01bba9e-181e-4fad-8b96-c61db8350cf5 failed for more than 3 times
the dpgen.log file: 2021-04-01 17:08:16,365 - INFO : =============================iter.000001============================== 2021-04-01 17:08:16,365 - INFO : -------------------------iter.000001 task 00-------------------------- 2021-04-01 17:08:16,407 - INFO : -------------------------iter.000001 task 01-------------------------- 2021-04-01 17:08:16,457 - INFO : new submission of f01bba9e-181e-4fad-8b96-c61db8350cf5 for chunk 8aefb06c426e07a0a671a1e2488b4858d694a730 2021-04-01 17:08:16,530 - INFO : new submission of f9cf7153-403f-4f3e-9c4e-928aae9010a8 for chunk e193a01ecf8d30ad0affefd332ce934e32ffce72 2021-04-01 17:08:16,576 - INFO : new submission of 9ac8ba18-fbda-4e38-98c4-a97637433a5f for chunk 6fc978af728d43c59faa400d5f6e0471ac850d4c 2021-04-01 17:08:16,618 - INFO : new submission of 604657a5-597a-4185-9b24-b6f62216e38b for chunk 221407c03ae5c73109cce71d27e24637824f3333 2021-04-01 17:09:16,797 - INFO : job f01bba9e-181e-4fad-8b96-c61db8350cf5 terminated, submit again 2021-04-01 17:09:16,875 - INFO : job f9cf7153-403f-4f3e-9c4e-928aae9010a8 terminated, submit again 2021-04-01 17:09:16,950 - INFO : job 9ac8ba18-fbda-4e38-98c4-a97637433a5f terminated, submit again 2021-04-01 17:09:17,013 - INFO : job 604657a5-597a-4185-9b24-b6f62216e38b terminated, submit again 2021-04-01 17:10:17,192 - INFO : job f01bba9e-181e-4fad-8b96-c61db8350cf5 terminated, submit again 2021-04-01 17:10:17,278 - INFO : job f9cf7153-403f-4f3e-9c4e-928aae9010a8 terminated, submit again 2021-04-01 17:10:17,344 - INFO : job 9ac8ba18-fbda-4e38-98c4-a97637433a5f terminated, submit again 2021-04-01 17:10:17,402 - INFO : job 604657a5-597a-4185-9b24-b6f62216e38b terminated, submit again 2021-04-01 17:11:17,538 - INFO : job f01bba9e-181e-4fad-8b96-c61db8350cf5 terminated, submit again 2021-04-01 17:11:17,653 - INFO : job f9cf7153-403f-4f3e-9c4e-928aae9010a8 terminated, submit again 2021-04-01 17:11:17,717 - INFO : job 9ac8ba18-fbda-4e38-98c4-a97637433a5f terminated, submit again 2021-04-01 17:11:17,786 - INFO : job 604657a5-597a-4185-9b24-b6f62216e38b terminated, submit again
In the f01bba9e-181e-4fad-8b96-c61db8350cf5 folder, four slurm-*.out file have the same content: /home/XXX/scratch/anaconda3_bob/bin/conda: line 3: import: command not found /home/XXX/scratch/anaconda3_bob/bin/conda: line 6: syntax error near unexpected token
sys.argv' /home/XXX/scratch/anaconda3_bob/bin/conda: line 6:
if len(sys.argv) > 1 and sys.argv[1].startswith('shell.') and sys.path and sys.path[0] == '':'the file of /home/XXX/scratch/anaconda3_bob/bin/conda:
!/home/elgao/scratch/anaconda3_bob/bin/python
-- coding: utf-8 --
import sys
Before any more imports, leave cwd out of sys.path for internal 'conda shell.*' commands.
see https://github.com/conda/conda/issues/6549
if len(sys.argv) > 1 and sys.argv[1].startswith('shell.') and sys.path and sys.path[0] == '':
The standard first entry in sys.path is an empty string,
if name == 'main': from conda.cli import main sys.exit(main())
the machine file machine-slurm.json : { "train": [ { "machine": { "batch": "slurm", "work_path": "/home/XXX/scratch/deepmd_bob/dpwork" }, "resources": { "numb_node": 1, "numb_gpu": 0, "task_per_node": 8, "with_mpi": false, "name": "dp_pj", "partition": "pub", "time_limit": "3600:00:00", "exclude_list": [], "source_list": [ " ~/.bashrc","conda activate deepc" ], "module_list": [] }, "command": "dp", "group size": 1 } ], "model_devi": [ { "machine": { "batch": "slurm", "work_path": "/home/XXX/scratch/deepmd_bob/dpwork" }, "resources": { "numb_node": 1, "numb_gpu": 0, "task_per_node": 16, "with_mpi": false, "partition": "pub", "name":"lmp", "time_limit": "3600:00:00", "exclude_list": [], "source_list": [ " ~/.bashrc","conda activate deepc"], "module_list": [] }, "command": "mpirun -np 8 /scratch/XXX/anaconda3_bob/envs/deepc/bin/lmp ", "group_size": 2 } ], "fp": [ { "machine": { "batch": "slurm", "work_path": "/home/XXX/scratch/deepmd_bob/dpwork" }, "resources": { "numb_node": 1, "numb_gpu": 0, "task_per_node": 16, "exclude_list": [], "with_mpi": false, "name": "aimd_zcb", "source_list": [], "module_list": [ "module load intel" ], "partition": "pub", "time_limit": "3600:00:00", "_comment": "that's All" }, "command": " srun -n 16 /project/XXX/00_apps/vasp6.1_isif_vtst_bob", "group_size": 10 } ] }
Please help me out of this.