Closed slambrechts closed 3 years ago
Hey Sam,
It is normal to see that message after submitting jobs with the metaGEM.sh
script, I believe you can simply press enter
or control+C
or command+C
to exit out of that message. You need to look at the nohup.out
file (e.g. less nohup.out
) to see if the jobs are being submitted, and you will also be able to see if they are failing in this file (you will see error messages letting you know if certain jobs have failed). Additionally, you can look inside the logs/
folder, where individual log files for each job will be generated when the jobs start (e.g. ll logs/
assuming you are in the metaGEM
folder).
If your nohup.out
file is empty then I would suspect that your jobs are not being properly submitted. If you now have a SLURM cluster then you can check the status of any active/pending jobs using the squeue
command. See some examples here or here.
But normal procedure on the HPC is still to submit jobs using qsub and then a job script you wrote, such as:
qsub metagem.pbs
Not sure what the metagem.pbs
file refers to, I do not think I wrote any such file. Perhaps it is your equivalent of the cluster_config.json
file?
I also tried running it using --local like I do on our local machine, but that doesn't work on the hpc because metagem is interactive (the y/n questions)
A few thoughts regarding the comment above:
--local
flag when running on the cluster, as this would result in your jobs running on a login node instead of submitting them to a production node. This is undesirable because the login nodes generally have little resources as they are not meant for running jobs, and you will probably incur the wrath of your cluster admins if you do this.metaGEM
with the --local
flag on the cluster. The y/n
interactive questions are only in the env_setup.sh
script, which you can run on the login node e.g. bash env_setup.sh
.Hope this helps and let me know if you have any other problems!
Best, Francisco
Hi Francisco,
Thank you for your answers. The nohup.out
file only contains
Error parsing number of cores (--cores, -c): must be integer, empty, or 'all'.
and the logs folder appears to be empty.
The metagem.pbs
file is a batch job script that I wrote that contains the commands that need to be
executed on the compute node, and the computer resource requirements. For each batch job we want to submit we need to write such a script. It looks like this:
#!/bin/bash
#PBS -N metagem
#PBS -o /user/gent/423/vsc42339/metagem.out
#PBS -e /user/gent/423/vsc42339/metagem.err
#PBS -m abe
#PBS -l nodes=1:ppn=48,walltime=71:59:59
cd $VSC_SCRATCH_VO/vsc42339/MICROBIAN/metaGEM
conda activate $VSC_SCRATCH_VO/vsc42339/MICROBIAN/metaGEM/envs/metagem
bash metaGEM.sh -t megahit -c 48 --local
but that is when I tried running metagem in local mode with a batch job script on the cluster. Which didn't work, because metaGEM.sh (or snakemake?) asks a few y/n questions in the beginning (do you want to continue with this config.yaml settings, etc...)
I found a workaround by submitting an Interactive job which gives me an interactive session on one of the compute nodes, and let's me answer the y/n questions.
But I do would like to be able to run metaGEM in cluster mode, but not sure why that didn't work.
I think I see the issue here Sam: it looks like you are trying to submit jobs that would themselves try to submit jobs!
You need to run metaGEM.sh
only once on the login node (e.g. bash metaGEM.sh -t fastp -j 43 -c 4 -m 20 -h 3
), and this command itself does the job submitting, e.g. lines 424, and 443:
What you are doing now is to submit batches of metaGEM.sh
, which would each submit their own batch of jobs, because of they way the script is written. This also explains why you were running into issues with those y/n
interactive questions.
If you are on a non-slurm cluster then I suspect you would have to modify those lines (424,443) and the cluster_config.json
file. However, you mentioned that you are now on a SLURM cluster, so maybe you can try without any modifications?
I think my previous post was a bit confusing, with trying to explain both what I did for running metaGEM in cluster mode, and after that when that didn't work I tried to run it in local mode using a batch job script. For metaGEM in cluster mode I did not use such a job script, I did the following:
1) Open a tmux window on the login node 2) activate the metagem conda env 3) bash metaGEM.sh -t fastp -j 43 -c 4 -m 20 -h 3
So I don't think I tried to submit jobs that would themselves try to submit jobs, but let me know if that is still the case
I see, indeed I confused your two separate attempts. I am curious though, why do you need to open a tmux window on the login node? I am not familiar with tmux, so I do not know if this may be part of the problem or not.
Error parsing number of cores (--cores, -c): must be integer, empty, or 'all'.
Based on this error message, it appears that the number of cores are not being recognized or provided to Snakemake. Have you modified your metaGEM.sh
script in any way to tailor the job submission to your cluster (in particular lines 424 or 443)? If yes, can you share that with me and/or verify that you are providing the number of cores? Also if yes, could you try running an original copy of the metaGEM.sh
script since you said that your cluster is now SLURM based?
If you really wanted to use the --local
flag on the cluster you could modify the following lines in the metaGEM.sh
script:
https://github.com/franciscozorrilla/metaGEM/blob/d81186a0700f974b4f57db587b71b960a951db83/metaGEM.sh#L339-L364
You can uncomment or simply remove unnecessary y/n
checks so that your submitLocal() function looks like this:
submitLocal() {
# Parse Snakefile rule all (line 22 of Snakefile) input to match output of desired target rule stored in "$string". Note: Hardcoded line number.
echo "Parsing Snakefile to target rule: $task ... "
sed -i "22s~^.*$~ $string~" Snakefile
echo "Unlocking snakemake ... "
snakemake --unlock -j 1
echo -e "\nDry-running snakemake jobs ... "
snakemake all -n
snakemake all -j 1 -k
}
Now you should be able to run your script as shown below, although the -c 48
flag is not used/necessary when the --local
flag is enabled. I would, however, monitor the job to make sure that it is using the number of cores it is assigned in the config.yaml
file. I would also be very careful when submitting multiple of these "local" cluster jobs, since jobs may start multiple times for the same samples.
#!/bin/bash #PBS -N metagem #PBS -o /user/gent/423/vsc42339/metagem.out #PBS -e /user/gent/423/vsc42339/metagem.err #PBS -m abe #PBS -l nodes=1:ppn=48,walltime=71:59:59 cd $VSC_SCRATCH_VO/vsc42339/MICROBIAN/metaGEM conda activate $VSC_SCRATCH_VO/vsc42339/MICROBIAN/metaGEM/envs/metagem bash metaGEM.sh -t megahit -c 48 --local
thank you francisco! I did not modify the metaGEM.sh script. So not sure why the number of cores are not being recognized or provided to Snakemake.
I use tmux because on the HPC of our uni you are constantly kicked out from the login nodes and you have to login again, that is standard procedure if I'm not mistaken, after a fixed amount of time or inactivity I guess. So if we want that something keeps running, we need to use something like a tmux window. Maybe tmux is not necessary if all jobs are submitted at once when running bash metaGEM.sh -t fastp -j 43 -c 4 -m 20 -h 3
, because once they are in the queue then it would be fine.
How are jobs being submitted behind the scenes in cluster mode? Using qsub?
I'm not entirely sure our HPC is SLURM based. What I do know atm is this:
When a job starts or ends I get an e-mail containing:
Slurm Job_id=50322561 Name=metagem Began, Queued time 00:00:00
And recently they told us this:
we have switched to new job command wrappers (qsub, qstat, qdel, etc.) on all HPC-UGent Tier-2 clusters.
As some of you are no doubt aware, we moved from Torque PBS [2] to Slurm [3] for the resource management and job scheduling software a while ago.
To make this switch as transparent as possible, wrapper commands were put in place to limit impact as much as possible: the same job commands can still be used, and no changes were required to job scripts.
Hi Sam, please see lines 424 and 443 of the metaGEM.sh
script, also shown in this post https://github.com/franciscozorrilla/metaGEM/issues/58#issuecomment-867537647, to understand how jobs are submitted. Just FYI, metaGEM.sh
launches jobs with line 424 when the -m
parameter is not provided in the bash metaGEM.sh
call. When the -m
flag is provided, jobs will be launched with line 443.
For example, in line 424, the following code is used to submit jobs:
nohup snakemake all -j $njobs -k --cluster-config cluster_config.json -c 'sbatch -A {cluster.account} -t {cluster.time} -n {cluster.n} --ntasks {cluster.tasks} --cpus-per-task {cluster.n} --output {cluster.output}' &
As you can see, it uses values from the cluster_config.json
, which should look like this:
More specifically, -n {cluster.n}
is used to specify the number of cores. I suspect that you may be missing this parameter in your cluster_config.json
file?
Hi francisco,
Ok, thank you for the information. Currently, my cluster_config.json
file looks like this:
{
"__default__" : {
"account" : "vsc42339",
"time" : "0-4:00:00",
"n" : 4,
"tasks" : 1,
"mem" : 20G,
"name" : "DL.{rule}",
"output" : "logs/{wildcards}.%N.{rule}.out.log",
},
}
I will try cluster mode again later and report back, for now I am running metaGEM in local mode
closing due to inactivity, please reopen if issues arise
Hi Francisco,
I have question reading job submission on the cluster. Our University cluster uses Torque/Moab as a scheduler. Is it possible to modify the scheduler in the pipeline?
Kindly let me know. Thank you. Kunal
Hi Kunal,
Indeed it is possible, the pipeline was designed in Snakemake so that it could be deployed on any cluster. Have a look a this tutorial to see how you can submit jobs the a torque/moab cluster using Snakemake in particular. Hope this helps!
Best, Francisco
Thank you for your reply and the details. I will try it out.
Best, Kunal
Hi,
I'm also trying to run metagem on our HPC, and I was wondering how I can check that jobs are actually being submitted or running? I ran
bash metaGEM.sh -t fastp -j 43 -c 4 -m 20 -h 3
in a tmux window, and now it seems to be stuck in:Is this normal?
Our HPC recently moved from Torque PBS to Slurm for the resource management and job scheduling software, and they rewrote the job wrappers and everything. But normal procedure on the HPC is still to submit jobs using qsub and then a job script you wrote, such as:
qsub metagem.pbs
Any idea whether metagem would work in cluster mode on our HPC?
I also tried running it using
--local
like I do on our local machine, but that doesn't work on the hpc because metagem is interactive (the y/n questions)