Closed kenneditodd closed 1 year ago
Since I am submitting each command as a slurm job, would each chunk require it's own submission?
In general, one chunk per cluster submission.
Speed wise, for example, is 10 chunks with 10 threads much faster than submitting 1 job with 100 threads?
CCS scales linearly with the number of threads. We even run it with 256 threads on dual 64 physical core CPUs. But this only works if IO is not rate limiting. If you see that not all 100 threads are saturated, then you have an IO bottleneck and are better off splitting it into smaller chunks.
@armintoepfer
I keep getting an error that says ccs ERROR: Wrong format for --chunk, please provide two integers separated by a slash like 2/10. First number must be less than the second number. Both must be positive and greater than 0.
I created a script that passes 3 variables ($chunk, $file, $sample) to the script shown below. Note the $chunk variable just iterates from 1 to 20.
# Set variables
out=../chunks/"$sample"_chunk"$chunk"_ccs.bam
num=$(($chunk)) # convert to integer
# check if variable is integer
re='^[0-9]+$'
if ! [[ $num =~ $re ]] ; then
echo "error: Not a number" >&2; exit 1
fi
# Generate circular consesus sequencing (CCS) reads from subreads
# 20 chunks, each chunk has 50 threads
ccs --num-threads 50 --chunk $num/20 $file $out
In my code I even test if the variable being passed is an integer and it is - why isn't this working? Also, i noticed the error message says the first number must be less than the second number. However, it should say less than OR EQUAL TO the second number based on the documentation show for parallelization.
The less than is a small typo, correct.
Maybe try with --chunk ${num}/20
. This is not a ccs
issue, but your bash script.
@armintoepfer Still no luck.
I got this to work by submitting an array job to slurm!
#!/bin/bash
#SBATCH --job-name ccs
#SBATCH --ntasks 50
#SBATCH --mem 50GB
#SBATCH --time 20:00:00
#SBATCH --output logs/%x.%N.%j.stdout
#SBATCH --error logs/%x.%N.%j.stderr
#SBATCH --array 1-10
#SBATCH --partition cpu-short
# Source settings
source $HOME/.bash_profile
# View array info
echo Array task = $SLURM_ARRAY_TASK_ID
echo Array count = $SLURM_ARRAY_TASK_COUNT
# Set variables
in=sample_subreads.bam
out=sample_chunk"$SLURM_ARRAY_TASK_ID".ccs.bam
# Run CCS
ccs --num-threads 50 --chunk $SLURM_ARRAY_TASK_ID/$SLURM_ARRAY_TASK_COUNT $in $out
Hello,
I am trying to parallelize the ccs command. I am still confused after reading the ccs.how. I have access to a cluster and want to improve speed. I see that there is a chunk option and thread option. Since I am submitting each command as a slurm job, would each chunk require it's own submission? So, if I were to do 10 chunks then i need to submit 10 jobs? Speed wise, for example, is 10 chunks with 10 threads much faster than submitting 1 job with 100 threads?