SMTG-Bham / ShakeNBreak

Defect structure-searching employing chemically-guided bond distortions
https://shakenbreak.readthedocs.io
MIT License
82 stars 18 forks source link

HPS rejects tasks given by snb-run #66

Closed BAN-2 closed 8 months ago

BAN-2 commented 8 months ago

In my case, the snb-run command sends jobs to HPS. But the server doesn't execute them (assigned PD or CF state) and after some time an error message appears in the defect directory (see attached).

At the same time, when manually running the task with the sbutch command, the task is calculated.

Could you please tell, where I am using this command incorrectly?

snb-run1_log.txt slurm-2500867_out.txt run_relax_sh.txt

BAN-2 commented 8 months ago

I have just discovered that running jobs on our HPS through SLURM is not possible from virtual environment.

How can I enable SNB CLI commands in this case?

ireaml commented 8 months ago

Hi! Is the file run_relax_sh.txt the job script that you're using to submit the ShakeNBreak calculations? (i.e. the one present in the directory from where you call snb-run) Looking at that file, it seems like it uses the vasp_ncl executable. ShakeNBreak uses Gamma-point only calculations to perform structure searching efficiently, and so we recommend to use the vasp_gam executable (more efficient for Gamma-point only calcs). Could you try again using vasp_gam instead of vasp_ncl?

Also, what job script did you use when running the calculation manually? From the errors in slurm-2500867_out.txt, it seems like the issue is not related to ShakeNBreak or your python virtual environment, but to the job script configuration. For instance, the error

ERROR: GCCcore/10.3.0 cannot be loaded due to a conflict.
    HINT: Might try "module unload GCCcore" first.
    (...)

srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.

is suggesting that you might need to update your job script by adding module unload GCCcore and removing the memory specification line (e.g. #SBATCH --mem-per-cpu=3GB):

#!/bin/bash
#SBATCH --time=04:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=10
#SBATCH --partition=short

source /usr/local/sbin/modules.sh

module unload GCCcore
module load VASP/5.4.4-intel2021a
export I_MPI_PMI_LIBRARY=/opt/slurm/current/lib64/libpmi.so
cd $SLURM_SUBMIT_DIR

srun -n 10 vasp_gam > log

It might be easier to first submit manually until you find the job script configuration that works and then using the snb-run command to automatise the process.

Hope that helps!

kavanase commented 8 months ago

Hi @BAN-2! Is your issue now fixed? If so, we'll close this issue as completed (and you can always reopen if needs be) 😄

BAN-2 commented 8 months ago

"Could you try again using vasp_gam instead of vasp_ncl?"

Here is the list of VASP versions from our HPC: VASP/5.4.4-intel2021a VASP/5.4.4-vtst198-intel2021a VASP/6.2.0-intel2021a VASP/6.4.0-intel2022a VASP/6.4.2-intel2022a.

Could you explain how can I run vasp_gam version?

ireaml commented 8 months ago

After loading the VASP module available in your HPC, you just call vasp_gam rather than vasp_ncl in your job script. For example, adapting the run_relax_sh.txt that you originally attached to something like:

#!/bin/bash
#SBATCH --time=04:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=10
#SBATCH --partition=short

source /usr/local/sbin/modules.sh

module unload GCCcore
module load VASP/6.4.2-intel2022a
export I_MPI_PMI_LIBRARY=/opt/slurm/current/lib64/libpmi.so
cd $SLURM_SUBMIT_DIR

srun -n 10 vasp_gam > log

where we load the VASP/6.4.2-intel2022a (generally better to use newer versions of VASP) and call the executable vasp_gam. Note that in your script you're using 10 tasks per node. Typically you can set the number of tasks to the number of cores per node to increase parallelisation.

BAN-2 commented 8 months ago

I ran the script with changes under the interactive mode and got the next reply `Unloading GCCcore/10.3.0 ERROR: GCCcore/10.3.0 cannot be unloaded due to a prereq. HINT: Might try "module unload Python/3.9.5-GCCcore-10.3.0" first. Loading GCCcore/11.3.0 ERROR: GCCcore/11.3.0 cannot be loaded due to a conflict. HINT: Might try "module unload GCCcore" first. Loading GCCcore/11.3.0 ERROR: GCCcore/11.3.0 cannot be loaded due to a conflict. HINT: Might try "module unload GCCcore" first.

Loading intel/2022a ERROR: Requirement 'GCCcore/11.3.0' is not loaded

Loading VASP/6.4.2-intel2022a ERROR: Load of requirement 'GCCcore/11.3.0' failed ERROR: Load of requirement 'intel/2022a' failed srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive. `

Text of the script is below

`#!/bin/bash

SBATCH --time=72:00:00

SBATCH --nodes=4

SBATCH --ntasks-per-node=4

SBATCH --partition=short

source /usr/local/sbin/modules.sh

module unload GCCcore module load GCCcore/11.3.0 module load VASP/6.4.2-intel2022a export I_MPI_PMI_LIBRARY=/opt/slurm/current/lib64/libpmi.so cd $SLURM_SUBMIT_DIR

srun -n 16 vasp_gam > log`

BAN-2 commented 8 months ago

The same mesage I got on the previouse stage

Unloading GCCcore/10.3.0
  ERROR: GCCcore/10.3.0 cannot be unloaded due to a prereq.
    HINT: Might try "module unload Python/3.9.5-GCCcore-10.3.0" first.
Loading GCCcore/11.3.0
  ERROR: GCCcore/11.3.0 cannot be loaded due to a conflict.
    HINT: Might try "module unload GCCcore" first.

Loading intel/2022a
  ERROR: Requirement 'GCCcore/11.3.0' is not loaded

Loading VASP/6.4.2-intel2022a
  ERROR: Load of requirement 'GCCcore/11.3.0' failed
  ERROR: Load of requirement 'intel/2022a' failed
srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.

Which was runned according your advise

#!/bin/bash
#SBATCH --time=72:00:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --partition=short

source /usr/local/sbin/modules.sh

module unload GCCcore
module load VASP/6.4.2-intel2022a
export I_MPI_PMI_LIBRARY=/opt/slurm/current/lib64/libpmi.so
cd $SLURM_SUBMIT_DIR

srun -n 16 vasp_gam > log
ireaml commented 8 months ago

That's an issue specific to your HPC: need to load the correct modules before loading VASP. From those errors, I'd suggest to first find the right combination of modules in an interactive session, and then just copy the loading commands to the job script. From the errors you showed above, could try:

module purge # clean environment
module load GCCcore/11.3.0
module load intel/2022a
module load VASP/6.4.2-intel2022a
module list  # list loaded modules
BAN-2 commented 8 months ago

I chaged the script

#!/bin/bash
#SBATCH --time=72:00:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --partition=normal
#SBATCH --mem-per-cpu=3GB

source /usr/local/sbin/modules.sh

module purge # clean environment
module load GCCcore/11.3.0
module load intel/2022a
module load VASP/6.4.2-intel2022a
export I_MPI_PMI_LIBRARY=/opt/slurm/current/lib64/libpmi.so
cd $SLURM_SUBMIT_DIR

srun -n 16 vasp_gam > log

And got

 VASP/6.4.2-intel2022a unload complete.
Loading intel/2022a
  Loading requirement: zlib/1.2.12-GCCcore-11.3.0 binutils/2.38-GCCcore-11.3.0
    intel-compilers/2022.1.0 numactl/2.0.14-GCCcore-11.3.0
    UCX/1.12.1-GCCcore-11.3.0 impi/2021.6.0-intel-compilers-2022.1.0
    imkl/2022.1.0 iimpi/2022a imkl-FFTW/2022.1.0-iimpi-2022a
 VASP/6.4.2-intel2022a load complete.

Loading VASP/6.4.2-intel2022a
  Loading requirement: Wannier90/3.1.0-intel-2022a Szip/2.1.1-GCCcore-11.3.0
    HDF5/1.13.1-iimpi-2022a
srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.
ireaml commented 8 months ago

That's strange since in the script seems like you're only setting the memory per CPU. Can try deleting that line to test:

#!/bin/bash
#SBATCH --time=72:00:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --partition=normal

source /usr/local/sbin/modules.sh

module purge # clean environment
module load GCCcore/11.3.0
module load intel/2022a
module load VASP/6.4.2-intel2022a
export I_MPI_PMI_LIBRARY=/opt/slurm/current/lib64/libpmi.so
cd $SLURM_SUBMIT_DIR

srun -n 16 vasp_gam > log

But probably the managers of your HPC cluster might be able to provide better help, since this is not a ShakeNBreak related issue but rather an HPC related one.

BAN-2 commented 8 months ago

Thank you so much for your help. I now have more clarity about the operation of our server.

According to your recommendations, I will send the question to the HPC staff.

As soon as I get an answer from them, I will post the solution and close the discussion

BAN-2 commented 8 months ago

The answer from HPC staff. "not being able to run the srun command in interactive tasks. "