split_qscore.py ignoring reads

VandaLovejoy commented 7 months ago

Hello, I'm having issues with the split_qscore.py script. I wanted to filtered the output from basecalling done by Dorado using this: I completed with no error message and I got a stats.txt file like this:



pass_reads: 20080458
fail_reads: 1350286
total_reads: 21430744
pass fraction: 93.7%

#!/bin/bash
#SBATCH --job-name=split
#SBATCH --output=split_%j.out
#SBATCH --error=split_%j.err
#SBATCH --ntasks=1
#SBATCH --time=05:00:00
#SBATCH --mem=2G

module load StdEnv/2023 scipy-stack/2023b
module load python
module load samtools
samtools view -h dorado_basecalling_merge.bam | python3 split_qscore.py -q 10 -f fastq - ./ > stats.txt

Issue is, the bam file contains 4 times the total read number in the split file (real number of reads 85722924) and only a fraction of the reads were actually filtered.

Psy-Fer commented 7 months ago

Hey,

Switch the -f fastq argument to -f Sam, as the output of samtools view on a bam file is Sam format.

The difference in reads is because the fastq format is 1 read per 4 lines and Sam is 1 read per line.

Hope that helps.

James

VandaLovejoy commented 7 months ago

yes! output is now correct!

Psy-Fer / buttery-eel

split_qscore.py ignoring reads #35