deepmodeling / dpdispatcher

generate HPC scheduler systems jobs input scripts and submit these scripts to HPC systems and poke until they finish
https://docs.deepmodeling.com/projects/dpdispatcher/
GNU Lesser General Public License v3.0
42 stars 56 forks source link

Request "jsub" batch_type #449

Closed WangSimin123456 closed 3 months ago

WangSimin123456 commented 5 months ago

Hi everyone, I am a new user of dpgen. I use "jsub" command to submit dpmd task and everything works. However, I found that dpgen cannot support "jsub" batch_type, could you add it? Thanks.

njzjz commented 5 months ago

Is it an open-source software? No one can help if it is not accessible.

WangSimin123456 commented 4 months ago

I think "jsub" is like "PBS" or "Slurm", just a task submission program on Linux. To be honest, I don't know if it's an open source software. But I think so. I see others also using "jsub" to control submitted tasks ( https://hpc.nwsuaf.edu.cn/fwznB/2411B.htm). I just asked the administrator of HPC, and he said "jsub" is free.

njzjz commented 4 months ago

It looks a commercial software: http://www.jhinno.com/m/custom_case_05.html

In this case, you might need to contribute and test the code yourself. Others are unable to test it in a production environment.

WangSimin123456 commented 4 months ago

Ok, got it.

WangSimin123456 commented 4 months ago

Sorry to bother again. I manually added the "jsub" batch_type by changing the source code of the dpdispatcher (in fact it should be a "JH_UniScheduler" batch_type, very similar to "LSF"). However, I don't know how to verify the code. And I found that it is possible to install dpdispatcher with pip(https://docs.deepmodeling.com/projects/dpdispatcher/en/latest/install.html). Can I compile dpdispatcher with make? Can you give me some guidance on what to do next?

njzjz commented 3 months ago

pip can install from source. See its documentation https://pip.pypa.io/en/stable/cli/pip_install/#examples

WangSimin123456 commented 3 months ago

Thanks for your help. I have tested the codes by stetting the batch_type as "JH_UniScheduler". And everything seems fine till now, and the output information as following:

/public/software/deepmd-kit/lib/python3.11/site-packages/paramiko/transport.py:219: CryptographyDeprecationWarning: Blowfish has been deprecated and will be removed in a future release "class": algorithms.Blowfish, INFO:dpgen:-------------------------iter.000000 task 01-------------------------- DeepModeling

Version: 0.12.0 Path: /public/software/deepmd-kit/lib/python3.11/site-packages/dpgen

Dependency

 numpy     1.26.3   /public/software/deepmd-kit/lib/python3.11/site-packages/numpy
dpdata     0.2.17   /public/software/deepmd-kit/lib/python3.11/site-packages/dpdata

pymatgen unknown version or path monty 2024.2.2 /public/software/deepmd-kit/lib/python3.11/site-packages/monty ase 3.22.1 /public/software/deepmd-kit/lib/python3.11/site-packages/ase paramiko 2.8.1 /public/software/deepmd-kit/lib/python3.11/site-packages/paramiko custodian 2024.3.12 /public/software/deepmd-kit/lib/python3.11/site-packages/custodian

Reference

Please cite: Yuzhi Zhang, Haidi Wang, Weijie Chen, Jinzhe Zeng, Linfeng Zhang, Han Wang, and Weinan E, DP-GEN: A concurrent learning platform for the generation of reliable deep learning based potential energy models, Computer Physics Communications, 2020, 107206.

Description

2024-05-21 14:25:50,813 - INFO : info:check_all_finished: False 2024-05-21 14:25:51,063 - INFO : job: c133268c3125c1f72aaccc47c498adf2ca35a0f6 submit; job_id is 51861 2024-05-21 14:31:24,702 - INFO : job: c133268c3125c1f72aaccc47c498adf2ca35a0f6 51861 terminated; fail_cout is 1; resubmitting job 2024-05-21 14:31:24,914 - INFO : job:c133268c3125c1f72aaccc47c498adf2ca35a0f6 re-submit after terminated; new job_id is 51862 2024-05-21 14:31:25,381 - INFO : job:c133268c3125c1f72aaccc47c498adf2ca35a0f6 job_id:51862 after re-submitting; the state now is <JobStatus.running: 3> 2024-05-21 14:32:25,832 - INFO : job: c133268c3125c1f72aaccc47c498adf2ca35a0f6 51862 terminated; fail_cout is 2; resubmitting job 2024-05-21 14:32:26,065 - INFO : job:c133268c3125c1f72aaccc47c498adf2ca35a0f6 re-submit after terminated; new job_id is 51863 2024-05-21 14:32:26,505 - INFO : job:c133268c3125c1f72aaccc47c498adf2ca35a0f6 job_id:51863 after re-submitting; the state now is <JobStatus.running: 3> 2024-05-21 14:33:57,176 - INFO : job: c133268c3125c1f72aaccc47c498adf2ca35a0f6 51863 finished INFO:dpgen:-------------------------iter.000000 task 02-------------------------- INFO:dpgen:-------------------------iter.000000 task 03-------------------------- INFO:dpgen:-------------------------iter.000000 task 04-------------------------- 2024-05-21 14:33:58,282 - INFO : info:check_all_finished: False 2024-05-21 14:33:58,537 - INFO : job: cc2b618c439bcf69933255d4f993ae3d35303387 submit; job_id is 51864 2024-05-21 14:33:58,812 - INFO : job: e1e51730b45d4b477fff7ac79395420ddb4d9765 submit; job_id is 51865 2024-05-21 14:34:30,519 - INFO : job: cc2b618c439bcf69933255d4f993ae3d35303387 51864 finished 2024-05-21 14:34:30,747 - INFO : job: e1e51730b45d4b477fff7ac79395420ddb4d9765 51865 finished INFO:dpgen:-------------------------iter.000000 task 05-------------------------- INFO:dpgen:-------------------------iter.000000 task 06-------------------------- INFO:dpgen:system 000 candidate : 56 in 310 18.06 % INFO:dpgen:system 000 failed : 252 in 310 81.29 % INFO:dpgen:system 000 accurate : 2 in 310 0.65 % INFO:dpgen:system 000 accurate_ratio: 0.0065 thresholds: 1.0000 and 1.0000 eff. task min and max -1 20 number of fp tasks: 20 INFO:dpgen:-------------------------iter.000000 task 07-------------------------- 2024-05-21 14:34:31,253 - INFO : info:check_all_finished: False 2024-05-21 14:34:31,497 - INFO : job: edea12ed5e87f773b282a6bc5e1259037f6bc801 submit; job_id is 51866 2024-05-21 14:38:34,468 - INFO : job: edea12ed5e87f773b282a6bc5e1259037f6bc801 51866 finished INFO:dpgen:-------------------------iter.000000 task 08-------------------------- INFO:dpgen:failed frame: 0 in 20 0.00 % INFO:dpgen:failed tasks: 0 in 20 0.00 % INFO:dpgen:=============================iter.000001============================== INFO:dpgen:-------------------------iter.000001 task 00-------------------------- INFO:dpgen:-------------------------iter.000001 task 01-------------------------- 2024-05-21 14:38:37,451 - INFO : info:check_all_finished: False 2024-05-21 14:38:37,685 - INFO : job: c133268c3125c1f72aaccc47c498adf2ca35a0f6 submit; job_id is 51867

It is still running. And I think I have modified the codes successfully. Can I contribute the codes to dpdispatcher?