PacificBiosciences / pbbioconda

PacBio Secondary Analysis Tools on Bioconda. Contains list of PacBio packages available via conda.
BSD 3-Clause Clear License
251 stars 45 forks source link

parallelizing ccs command #598

Closed kenneditodd closed 1 year ago

kenneditodd commented 1 year ago

Hello,

I am trying to parallelize the ccs command. I am still confused after reading the ccs.how. I have access to a cluster and want to improve speed. I see that there is a chunk option and thread option. Since I am submitting each command as a slurm job, would each chunk require it's own submission? So, if I were to do 10 chunks then i need to submit 10 jobs? Speed wise, for example, is 10 chunks with 10 threads much faster than submitting 1 job with 100 threads?

armintoepfer commented 1 year ago

Since I am submitting each command as a slurm job, would each chunk require it's own submission?

In general, one chunk per cluster submission.

Speed wise, for example, is 10 chunks with 10 threads much faster than submitting 1 job with 100 threads?

CCS scales linearly with the number of threads. We even run it with 256 threads on dual 64 physical core CPUs. But this only works if IO is not rate limiting. If you see that not all 100 threads are saturated, then you have an IO bottleneck and are better off splitting it into smaller chunks.

kenneditodd commented 1 year ago

@armintoepfer

I keep getting an error that says ccs ERROR: Wrong format for --chunk, please provide two integers separated by a slash like 2/10. First number must be less than the second number. Both must be positive and greater than 0.

I created a script that passes 3 variables ($chunk, $file, $sample) to the script shown below. Note the $chunk variable just iterates from 1 to 20.

# Set variables
out=../chunks/"$sample"_chunk"$chunk"_ccs.bam
num=$(($chunk)) # convert to integer

# check if variable is integer
re='^[0-9]+$'
if ! [[ $num =~ $re ]] ; then
   echo "error: Not a number" >&2; exit 1
fi

# Generate circular consesus sequencing (CCS) reads from subreads
# 20 chunks, each chunk has 50 threads
ccs --num-threads 50 --chunk $num/20 $file $out

In my code I even test if the variable being passed is an integer and it is - why isn't this working? Also, i noticed the error message says the first number must be less than the second number. However, it should say less than OR EQUAL TO the second number based on the documentation show for parallelization.

armintoepfer commented 1 year ago

The less than is a small typo, correct.

Maybe try with --chunk ${num}/20. This is not a ccs issue, but your bash script.

kenneditodd commented 1 year ago

@armintoepfer Still no luck.

kenneditodd commented 1 year ago

I got this to work by submitting an array job to slurm!

#!/bin/bash
#SBATCH --job-name ccs
#SBATCH --ntasks 50
#SBATCH --mem 50GB
#SBATCH --time 20:00:00
#SBATCH --output logs/%x.%N.%j.stdout
#SBATCH --error logs/%x.%N.%j.stderr
#SBATCH --array 1-10
#SBATCH --partition cpu-short

# Source settings
source $HOME/.bash_profile

# View array info
echo Array task = $SLURM_ARRAY_TASK_ID
echo Array count = $SLURM_ARRAY_TASK_COUNT

# Set variables
in=sample_subreads.bam
out=sample_chunk"$SLURM_ARRAY_TASK_ID".ccs.bam

# Run CCS
ccs --num-threads 50 --chunk $SLURM_ARRAY_TASK_ID/$SLURM_ARRAY_TASK_COUNT $in $out