MikeAxtell / ShortStack

ShortStack: Comprehensive annotation and quantification of small RNA genes
MIT License
88 stars 29 forks source link

how to reserve the U&P reads of alignment bam file? #154

Closed Yuzhang0102 closed 1 week ago

Yuzhang0102 commented 2 months ago

Hi Mike, I used ShortStack4 aligned my sRNA reads to genome, and I got the alignment_details.tsv and many bam files. And I have annotation file predicted by other softwares, so I want to call counts using bam and annotation file by featureCounts.

But I need to filter reads that were labelled "R","H","N" in alignment_details.tsv, how can i filter them from bam files?

Thanks a lot.

MikeAxtell commented 2 months ago

BAM files made with ShortStack aligners contain the custom tag 'XY:Z:[x]', where [x] is the mapping type. x will be one of:

U: Uniquely mapped (not a multimapper). P: Multimapper placed using the method set by option --mmap. R: Multimapper placed at random. H: Very highly multi-mapped read (>=50 hits). N: Unmapped reads.

To filter the bam file to retain only P and U reads, you can use samtools view like so:

samtools view -b -d XY:P -d XY:U merged_alignments.bam > filtered.bam

Be careful though. Throwing out a large subset of your reads may not be a good idea. R reads and H reads still exist, even if their genomic origin is not 100% clear. Note that ShortStack selects a single genomic location for such reads from all possible ones, so it's not like there are 50 alignment lines for a single multimapped read.