alexdobin / STAR

RNA-seq aligner
MIT License
1.82k stars 501 forks source link

Make the "genome BAM to transcriptome BAM" a separate run mode #2020

Open zxl124 opened 8 months ago

zxl124 commented 8 months ago

STAR already provides an awesome function for generating a transcriptome alignment for downstream tools such as RSEM, Salmon via --quantMode TranscriptomeSAM. As far as I understand, the alignment is still performed against the genome, and then the genome BAM is "translated" into transcriptome BAM file by searching overlaps of alignments and transcripts. Is it possible to make this "translation" function available as a run mode? In other words, input file is a genome BAM, given GTF file, output a transcriptome BAM?

The reason I ask for this is that I have total RNAseq data with UMIs (not single cell data). In my older protocol with STAR-umi_tools-featurecounts, this works well, because deduplication is based on genomic alignment coordinates. However, I want to switch from featurecounts to Salmon for transcript level quantification. I ran STAR with the TranscriptomeSAM mode. When I feed the BAM to umi_tools, it will not guarantee removal of all PCR duplicates because deduplication is based on transcripts, and each read can align to multiple transcripts, and the choice of kept read in umi_tools is random when mapping quality are the same. For example, if I have 3 reads that are PCR duplicates with the same UMI, aligning to the same exon, same position, only one read will remain after deduplication on the genome BAM. However, if there are 3 different transcripts using that exon, deduplication will happen 3 times, each time a random read is kept, meaning the possibility of all 3 reads surviving.

If STAR provides a function to translate genome BAM to transcriptome BAM, I can run STAR on the genome, deduplicate, translate to transcriptome BAM, problem solved. Since the code already exists for this function, I am hoping make it a runMode won't be too hard.

If you have other suggestions that would solve this problem, that would be welcome as well. I briefly looked at STARsolo, but it won't work because my data actually have no cellular bar codes, only UMIs.

Thank you in advance.

alexdobin commented 8 months ago

Hi Zhenfeng,

This cannot be done with STAR presently, but I have a vague recollection that there are tools that can do it. You can try to ask about it at RNA-seq analysis forums.