Wang-Lin-boop / AutoMD

Easy to get started with molecular dynamics simulation.
GNU General Public License v3.0
38 stars 5 forks source link

Slurm operations #2

Open nikishe opened 1 month ago

nikishe commented 1 month ago

Hey AutoMD team, we have it working now , It works through stage 1-10. From stage 3-10 as it will be using a GPU , it creates a slurm job for every stage , is there a way to tell it to run all 7 gpu stages on one slurm job as I am getting penalised by the schedulers fair use policy; and it also means a lot of time is lost waiting for resources. Any advice?

nikishe commented 1 month ago

AutoMD -i "desmond_setup_2-out.cms" -S OUC -t 100 -H "cpu" -G "gpu"

Wang-Lin-boop commented 1 month ago

I think you could try to submit a slurm job to the computing node to run AutoMD with localhost.

nikishe commented 1 month ago

" to run AutoMD with localhost."

Sorry finding this abit vague(but probably due to my understanding) , can you show me an example, as I am reading it as submitting a job that kicks off a process, outside the scheduler

Wang-Lin-boop commented 1 month ago

You just need to run a local AutoMD job in your slurm script.

AutoMD -i "desmond_setup_2-out.cms" -S OUC -t 100 -H "localhost" -G "localhost"

Meanwhile, don't forget to modify your host file to add the GPU to the queue localhost. Such as:

name: localhost
gpgpu: 0, Tesla V100
gpgpu: 1, Tesla V100
gpgpu: 2, Tesla V100
gpgpu: 3, Tesla V100
Wang-Lin-boop commented 1 month ago

You can refer to the previous issue: Installation](https://github.com/Wang-Lin-boop/AutoMD/issues/1#issuecomment-1983836184)

nikishe commented 1 month ago

Wont this run the job outside the scheduler? I will give it a go, but I worry this will run ouutside the scheduler risking other peoples jobs on that node

Wang-Lin-boop commented 1 month ago

No, you need to submit this slurm job to the scheduler via sbatch.

Wang-Lin-boop commented 1 month ago

Have you submitted a slurm job script? It seems like you don't use sbatch to submit jobs very often. Please refer to slurm

Wang-Lin-boop commented 1 month ago

The slurm job script look like:

cat<<EOF > AutoMD.slurm
#! /bin/bash
#SBATCH --job-name=AutoMD
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu
#SBATCH --ntasks=6
#SBATCH --time=120:00:00
#SBATCH --output=AutoMD.out
#SBATCH --error=AutoMD.out
AutoMD -i "desmond_setup_2-out.cms" -S OUC -t 100 -H "localhost" -G "localhost"
EOF
sbatch --gpus=1 AutoMD.slurm
nikishe commented 1 month ago

Thank you for your suggestions its all working now, Issue was

                          `   -H "localhost" -G "localhost"`

Was under the impression H and G had to point to something that creates a job on slurm . in my case they were pointing to:

# 1 hour wall time, 40 tasks with default 1 cpu/task
name:        batch-small
host:        localhost
schrodinger: ${SCHRODINGER} 
queue:       SLURM2.1
qargs:       --export=ALL --cpus-per-task=1  --mem-per-cpu=10GB --time=00:20:00  --partition=gpu-h100 --qos=gpu --gres=gpu:h100:1
tmpdir:      /tmp

# 1 hour wall time, 40 tasks with default 1 cpu/task
name:        batch-a100
host:        localhost
schrodinger: ${SCHRODINGER}
queue:       SLURM2.1
qargs:       --export=ALL --cpus-per-task=1  --mem-per-cpu=10GB --time=00:20:00  --partition=gpu --qos=gpu --gres=gpu:1 
tmpdir:      /tmp

This and the use of both interactive and batch jobs led to my issues. I think a few scenarios in the readme might help future newbys. I am happy to do a pull request if you think it will be a good idea

Wang-Lin-boop commented 1 month ago

The localhost here means localhost in computing node, not the login node. This is equal to you using the queue information in the hosts file to have Desmond commit each stage to the compute node individually. Both are allowed by the scheduler.

Wang-Lin-boop commented 1 month ago

Glad you solved the problem. It's a good idea to organize some of the issues that newbies might have.