How can I check jobs are being submitted to our cluster

franciscozorrilla / metaGEM

:gem: An easy-to-use workflow for generating context specific genome-scale metabolic models and predicting metabolic interactions within microbial communities directly from metagenomic data

https://franciscozorrilla.github.io/metaGEM/

MIT License

189 stars 41 forks source link

How can I check jobs are being submitted to our cluster #58

Closed slambrechts closed 3 years ago

slambrechts commented 3 years ago

Hi,

I'm also trying to run metagem on our HPC, and I was wondering how I can check that jobs are actually being submitted or running? I ran bash metaGEM.sh -t fastp -j 43 -c 4 -m 20 -h 3 in a tmux window, and now it seems to be stuck in:

nohup: appending output to 'nohup.out'

Is this normal?

Our HPC recently moved from Torque PBS to Slurm for the resource management and job scheduling software, and they rewrote the job wrappers and everything. But normal procedure on the HPC is still to submit jobs using qsub and then a job script you wrote, such as:

qsub metagem.pbs

Any idea whether metagem would work in cluster mode on our HPC?

I also tried running it using --local like I do on our local machine, but that doesn't work on the hpc because metagem is interactive (the y/n questions)

franciscozorrilla commented 3 years ago

Hey Sam,

It is normal to see that message after submitting jobs with the metaGEM.sh script, I believe you can simply press enter or control+C or command+C to exit out of that message. You need to look at the nohup.out file (e.g. less nohup.out) to see if the jobs are being submitted, and you will also be able to see if they are failing in this file (you will see error messages letting you know if certain jobs have failed). Additionally, you can look inside the logs/ folder, where individual log files for each job will be generated when the jobs start (e.g. ll logs/ assuming you are in the metaGEM folder).

If your nohup.out file is empty then I would suspect that your jobs are not being properly submitted. If you now have a SLURM cluster then you can check the status of any active/pending jobs using the squeue command. See some examples here or here.

But normal procedure on the HPC is still to submit jobs using qsub and then a job script you wrote, such as:

qsub metagem.pbs

Not sure what the metagem.pbs file refers to, I do not think I wrote any such file. Perhaps it is your equivalent of the cluster_config.json file?

I also tried running it using --local like I do on our local machine, but that doesn't work on the hpc because metagem is interactive (the y/n questions)

A few thoughts regarding the comment above:

You should most probably not be using the --local flag when running on the cluster, as this would result in your jobs running on a login node instead of submitting them to a production node. This is undesirable because the login nodes generally have little resources as they are not meant for running jobs, and you will probably incur the wrath of your cluster admins if you do this.
Having said that, you should have no problem running metaGEM with the --local flag on the cluster. The y/n interactive questions are only in the env_setup.sh script, which you can run on the login node e.g. bash env_setup.sh.

Hope this helps and let me know if you have any other problems!

Best, Francisco

slambrechts commented 3 years ago

Hi Francisco,

Thank you for your answers. The nohup.out file only contains

Error parsing number of cores (--cores, -c): must be integer, empty, or 'all'.

and the logs folder appears to be empty.

The metagem.pbs file is a batch job script that I wrote that contains the commands that need to be executed on the compute node, and the computer resource requirements. For each batch job we want to submit we need to write such a script. It looks like this:

#!/bin/bash
#PBS -N metagem
#PBS -o /user/gent/423/vsc42339/metagem.out
#PBS -e /user/gent/423/vsc42339/metagem.err
#PBS -m abe
#PBS -l nodes=1:ppn=48,walltime=71:59:59

cd $VSC_SCRATCH_VO/vsc42339/MICROBIAN/metaGEM

conda activate $VSC_SCRATCH_VO/vsc42339/MICROBIAN/metaGEM/envs/metagem

bash metaGEM.sh -t megahit -c 48 --local

but that is when I tried running metagem in local mode with a batch job script on the cluster. Which didn't work, because metaGEM.sh (or snakemake?) asks a few y/n questions in the beginning (do you want to continue with this config.yaml settings, etc...)

I found a workaround by submitting an Interactive job which gives me an interactive session on one of the compute nodes, and let's me answer the y/n questions.

But I do would like to be able to run metaGEM in cluster mode, but not sure why that didn't work.

franciscozorrilla commented 3 years ago

I think I see the issue here Sam: it looks like you are trying to submit jobs that would themselves try to submit jobs! You need to run metaGEM.sh only once on the login node (e.g. bash metaGEM.sh -t fastp -j 43 -c 4 -m 20 -h 3), and this command itself does the job submitting, e.g. lines 424, and 443:

https://github.com/franciscozorrilla/metaGEM/blob/d81186a0700f974b4f57db587b71b960a951db83/metaGEM.sh#L367-L450

What you are doing now is to submit batches of metaGEM.sh, which would each submit their own batch of jobs, because of they way the script is written. This also explains why you were running into issues with those y/n interactive questions.

If you are on a non-slurm cluster then I suspect you would have to modify those lines (424,443) and the cluster_config.json file. However, you mentioned that you are now on a SLURM cluster, so maybe you can try without any modifications?

slambrechts commented 3 years ago

I think my previous post was a bit confusing, with trying to explain both what I did for running metaGEM in cluster mode, and after that when that didn't work I tried to run it in local mode using a batch job script. For metaGEM in cluster mode I did not use such a job script, I did the following:

1) Open a tmux window on the login node 2) activate the metagem conda env 3) bash metaGEM.sh -t fastp -j 43 -c 4 -m 20 -h 3

So I don't think I tried to submit jobs that would themselves try to submit jobs, but let me know if that is still the case

franciscozorrilla commented 3 years ago

Regarding cluster mode

I see, indeed I confused your two separate attempts. I am curious though, why do you need to open a tmux window on the login node? I am not familiar with tmux, so I do not know if this may be part of the problem or not.

Error parsing number of cores (--cores, -c): must be integer, empty, or 'all'.

Based on this error message, it appears that the number of cores are not being recognized or provided to Snakemake. Have you modified your metaGEM.sh script in any way to tailor the job submission to your cluster (in particular lines 424 or 443)? If yes, can you share that with me and/or verify that you are providing the number of cores? Also if yes, could you try running an original copy of the metaGEM.sh script since you said that your cluster is now SLURM based?

Regarding local mode

If you really wanted to use the --local flag on the cluster you could modify the following lines in the metaGEM.sh script: https://github.com/franciscozorrilla/metaGEM/blob/d81186a0700f974b4f57db587b71b960a951db83/metaGEM.sh#L339-L364

You can uncomment or simply remove unnecessary y/n checks so that your submitLocal() function looks like this:

 submitLocal() { 

     # Parse Snakefile rule all (line 22 of Snakefile) input to match output of desired target rule stored in "$string". Note: Hardcoded line number. 
     echo "Parsing Snakefile to target rule: $task ... " 
     sed  -i "22s~^.*$~        $string~" Snakefile 

     echo "Unlocking snakemake ... " 
     snakemake --unlock -j 1 

     echo -e "\nDry-running snakemake jobs ... " 
     snakemake all -n 

     snakemake all -j 1 -k
     }

Now you should be able to run your script as shown below, although the -c 48 flag is not used/necessary when the --local flag is enabled. I would, however, monitor the job to make sure that it is using the number of cores it is assigned in the config.yaml file. I would also be very careful when submitting multiple of these "local" cluster jobs, since jobs may start multiple times for the same samples.

#!/bin/bash
#PBS -N metagem
#PBS -o /user/gent/423/vsc42339/metagem.out
#PBS -e /user/gent/423/vsc42339/metagem.err
#PBS -m abe
#PBS -l nodes=1:ppn=48,walltime=71:59:59

cd $VSC_SCRATCH_VO/vsc42339/MICROBIAN/metaGEM

conda activate $VSC_SCRATCH_VO/vsc42339/MICROBIAN/metaGEM/envs/metagem

bash metaGEM.sh -t megahit -c 48 --local

slambrechts commented 3 years ago

thank you francisco! I did not modify the metaGEM.sh script. So not sure why the number of cores are not being recognized or provided to Snakemake.

I use tmux because on the HPC of our uni you are constantly kicked out from the login nodes and you have to login again, that is standard procedure if I'm not mistaken, after a fixed amount of time or inactivity I guess. So if we want that something keeps running, we need to use something like a tmux window. Maybe tmux is not necessary if all jobs are submitted at once when running bash metaGEM.sh -t fastp -j 43 -c 4 -m 20 -h 3, because once they are in the queue then it would be fine.

How are jobs being submitted behind the scenes in cluster mode? Using qsub?

I'm not entirely sure our HPC is SLURM based. What I do know atm is this:

When a job starts or ends I get an e-mail containing:

Slurm Job_id=50322561 Name=metagem Began, Queued time 00:00:00

And recently they told us this:

we have switched to new job command wrappers (qsub, qstat, qdel, etc.) on all HPC-UGent Tier-2 clusters.

As some of you are no doubt aware, we moved from Torque PBS [2] to Slurm [3] for the resource management and job scheduling software a while ago.

To make this switch as transparent as possible, wrapper commands were put in place to limit impact as much as possible: the same job commands can still be used, and no changes were required to job scripts.

franciscozorrilla commented 3 years ago

Hi Sam, please see lines 424 and 443 of the metaGEM.sh script, also shown in this post https://github.com/franciscozorrilla/metaGEM/issues/58#issuecomment-867537647, to understand how jobs are submitted. Just FYI, metaGEM.sh launches jobs with line 424 when the -m parameter is not provided in the bash metaGEM.sh call. When the -m flag is provided, jobs will be launched with line 443.

For example, in line 424, the following code is used to submit jobs:

nohup snakemake all -j $njobs -k --cluster-config cluster_config.json -c 'sbatch -A {cluster.account} -t {cluster.time} -n {cluster.n} --ntasks {cluster.tasks} --cpus-per-task {cluster.n} --output {cluster.output}' &

As you can see, it uses values from the cluster_config.json, which should look like this:

https://github.com/franciscozorrilla/metaGEM/blob/d81186a0700f974b4f57db587b71b960a951db83/cluster_config.json#L1-L11

More specifically, -n {cluster.n} is used to specify the number of cores. I suspect that you may be missing this parameter in your cluster_config.json file?

slambrechts commented 3 years ago

Hi francisco,

Ok, thank you for the information. Currently, my cluster_config.json file looks like this:

{
"__default__" : {
        "account" : "vsc42339",
        "time" : "0-4:00:00",
        "n" : 4,
        "tasks" : 1,
        "mem" : 20G,
        "name"      : "DL.{rule}",
        "output"    : "logs/{wildcards}.%N.{rule}.out.log",
},
}

I will try cluster mode again later and report back, for now I am running metaGEM in local mode

franciscozorrilla commented 3 years ago

closing due to inactivity, please reopen if issues arise

kunaljaani commented 1 year ago

Hi Francisco,

I have question reading job submission on the cluster. Our University cluster uses Torque/Moab as a scheduler. Is it possible to modify the scheduler in the pipeline?

Kindly let me know. Thank you. Kunal

franciscozorrilla commented 1 year ago

Hi Kunal,

Indeed it is possible, the pipeline was designed in Snakemake so that it could be deployed on any cluster. Have a look a this tutorial to see how you can submit jobs the a torque/moab cluster using Snakemake in particular. Hope this helps!

Best, Francisco

kunaljaani commented 1 year ago

Thank you for your reply and the details. I will try it out.

Best, Kunal