Closed BAN-2 closed 8 months ago
I have just discovered that running jobs on our HPS through SLURM is not possible from virtual environment.
How can I enable SNB CLI commands in this case?
Hi!
Is the file run_relax_sh.txt the job script that you're using to submit the ShakeNBreak calculations? (i.e. the one present in the directory from where you call snb-run
)
Looking at that file, it seems like it uses the vasp_ncl
executable. ShakeNBreak uses Gamma-point only calculations to perform structure searching efficiently, and so we recommend to use the vasp_gam
executable (more efficient for Gamma-point only calcs). Could you try again using vasp_gam
instead of vasp_ncl
?
Also, what job script did you use when running the calculation manually? From the errors in slurm-2500867_out.txt, it seems like the issue is not related to ShakeNBreak or your python virtual environment, but to the job script configuration. For instance, the error
ERROR: GCCcore/10.3.0 cannot be loaded due to a conflict.
HINT: Might try "module unload GCCcore" first.
(...)
srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.
is suggesting that you might need to update your job script by adding module unload GCCcore
and removing the memory specification line (e.g. #SBATCH --mem-per-cpu=3GB
):
#!/bin/bash
#SBATCH --time=04:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=10
#SBATCH --partition=short
source /usr/local/sbin/modules.sh
module unload GCCcore
module load VASP/5.4.4-intel2021a
export I_MPI_PMI_LIBRARY=/opt/slurm/current/lib64/libpmi.so
cd $SLURM_SUBMIT_DIR
srun -n 10 vasp_gam > log
It might be easier to first submit manually until you find the job script configuration that works and then using the snb-run
command to automatise the process.
Hope that helps!
Hi @BAN-2! Is your issue now fixed? If so, we'll close this issue as completed (and you can always reopen if needs be) 😄
"Could you try again using vasp_gam instead of vasp_ncl?"
Here is the list of VASP versions from our HPC: VASP/5.4.4-intel2021a VASP/5.4.4-vtst198-intel2021a VASP/6.2.0-intel2021a VASP/6.4.0-intel2022a VASP/6.4.2-intel2022a.
Could you explain how can I run vasp_gam version?
After loading the VASP module available in your HPC, you just call vasp_gam
rather than vasp_ncl
in your job script. For example, adapting the run_relax_sh.txt that you originally attached to something like:
#!/bin/bash
#SBATCH --time=04:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=10
#SBATCH --partition=short
source /usr/local/sbin/modules.sh
module unload GCCcore
module load VASP/6.4.2-intel2022a
export I_MPI_PMI_LIBRARY=/opt/slurm/current/lib64/libpmi.so
cd $SLURM_SUBMIT_DIR
srun -n 10 vasp_gam > log
where we load the VASP/6.4.2-intel2022a
(generally better to use newer versions of VASP) and call the executable vasp_gam
. Note that in your script you're using 10 tasks per node. Typically you can set the number of tasks to the number of cores per node to increase parallelisation.
I ran the script with changes under the interactive mode and got the next reply `Unloading GCCcore/10.3.0 ERROR: GCCcore/10.3.0 cannot be unloaded due to a prereq. HINT: Might try "module unload Python/3.9.5-GCCcore-10.3.0" first. Loading GCCcore/11.3.0 ERROR: GCCcore/11.3.0 cannot be loaded due to a conflict. HINT: Might try "module unload GCCcore" first. Loading GCCcore/11.3.0 ERROR: GCCcore/11.3.0 cannot be loaded due to a conflict. HINT: Might try "module unload GCCcore" first.
Loading intel/2022a ERROR: Requirement 'GCCcore/11.3.0' is not loaded
Loading VASP/6.4.2-intel2022a ERROR: Load of requirement 'GCCcore/11.3.0' failed ERROR: Load of requirement 'intel/2022a' failed srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive. `
Text of the script is below
`#!/bin/bash
source /usr/local/sbin/modules.sh
module unload GCCcore module load GCCcore/11.3.0 module load VASP/6.4.2-intel2022a export I_MPI_PMI_LIBRARY=/opt/slurm/current/lib64/libpmi.so cd $SLURM_SUBMIT_DIR
srun -n 16 vasp_gam > log`
The same mesage I got on the previouse stage
Unloading GCCcore/10.3.0
ERROR: GCCcore/10.3.0 cannot be unloaded due to a prereq.
HINT: Might try "module unload Python/3.9.5-GCCcore-10.3.0" first.
Loading GCCcore/11.3.0
ERROR: GCCcore/11.3.0 cannot be loaded due to a conflict.
HINT: Might try "module unload GCCcore" first.
Loading intel/2022a
ERROR: Requirement 'GCCcore/11.3.0' is not loaded
Loading VASP/6.4.2-intel2022a
ERROR: Load of requirement 'GCCcore/11.3.0' failed
ERROR: Load of requirement 'intel/2022a' failed
srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.
Which was runned according your advise
#!/bin/bash
#SBATCH --time=72:00:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --partition=short
source /usr/local/sbin/modules.sh
module unload GCCcore
module load VASP/6.4.2-intel2022a
export I_MPI_PMI_LIBRARY=/opt/slurm/current/lib64/libpmi.so
cd $SLURM_SUBMIT_DIR
srun -n 16 vasp_gam > log
That's an issue specific to your HPC: need to load the correct modules before loading VASP. From those errors, I'd suggest to first find the right combination of modules in an interactive session, and then just copy the loading commands to the job script. From the errors you showed above, could try:
module purge # clean environment
module load GCCcore/11.3.0
module load intel/2022a
module load VASP/6.4.2-intel2022a
module list # list loaded modules
I chaged the script
#!/bin/bash
#SBATCH --time=72:00:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --partition=normal
#SBATCH --mem-per-cpu=3GB
source /usr/local/sbin/modules.sh
module purge # clean environment
module load GCCcore/11.3.0
module load intel/2022a
module load VASP/6.4.2-intel2022a
export I_MPI_PMI_LIBRARY=/opt/slurm/current/lib64/libpmi.so
cd $SLURM_SUBMIT_DIR
srun -n 16 vasp_gam > log
And got
VASP/6.4.2-intel2022a unload complete.
Loading intel/2022a
Loading requirement: zlib/1.2.12-GCCcore-11.3.0 binutils/2.38-GCCcore-11.3.0
intel-compilers/2022.1.0 numactl/2.0.14-GCCcore-11.3.0
UCX/1.12.1-GCCcore-11.3.0 impi/2021.6.0-intel-compilers-2022.1.0
imkl/2022.1.0 iimpi/2022a imkl-FFTW/2022.1.0-iimpi-2022a
VASP/6.4.2-intel2022a load complete.
Loading VASP/6.4.2-intel2022a
Loading requirement: Wannier90/3.1.0-intel-2022a Szip/2.1.1-GCCcore-11.3.0
HDF5/1.13.1-iimpi-2022a
srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.
That's strange since in the script seems like you're only setting the memory per CPU. Can try deleting that line to test:
#!/bin/bash
#SBATCH --time=72:00:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --partition=normal
source /usr/local/sbin/modules.sh
module purge # clean environment
module load GCCcore/11.3.0
module load intel/2022a
module load VASP/6.4.2-intel2022a
export I_MPI_PMI_LIBRARY=/opt/slurm/current/lib64/libpmi.so
cd $SLURM_SUBMIT_DIR
srun -n 16 vasp_gam > log
But probably the managers of your HPC cluster might be able to provide better help, since this is not a ShakeNBreak related issue but rather an HPC related one.
Thank you so much for your help. I now have more clarity about the operation of our server.
According to your recommendations, I will send the question to the HPC staff.
As soon as I get an answer from them, I will post the solution and close the discussion
The answer from HPC staff.
"not being able to run the srun
command in interactive tasks. "
In my case, the snb-run command sends jobs to HPS. But the server doesn't execute them (assigned PD or CF state) and after some time an error message appears in the defect directory (see attached).
At the same time, when manually running the task with the sbutch command, the task is calculated.
Could you please tell, where I am using this command incorrectly?
snb-run1_log.txt slurm-2500867_out.txt run_relax_sh.txt