alexdobin / STAR

RNA-seq aligner
MIT License
1.82k stars 502 forks source link

STAR compared to BWA performance #1275

Open RitaBogorad opened 3 years ago

RitaBogorad commented 3 years ago

Hello, everyone!

I am working on the analysis of the metatranscriptome of unicellular algae - non-model organism. I have a reference genome sequenced and published by my colleague. Due to the fact that this is my first metatranscriptome analysis, I started the analysis with mappers such as BWA and bowtie2 to map RNA reads to DNA (genome). This is considered a mistake when working with Eukaryote species because of the presence of introns. An obvious correction of this mistake was to use the splice-aware tool, such as STAR. However, I was curious to compare the results of BWA, bowtie to STAR. Just like many many other people here I faced the number of "too short" unmapped being super high ~60% and the STAR mapping percentage only 28% unique mapped reads. The mapping percentage with BWA was much higher - 64% mapped. I visualized the alignments in IGV and was quite impressed by the similarity of the results between the two mappers. The majority of the mapped reads produced both by BWA and STAR were at the same region on the genome and even the introns were detected in both cases. Also, comparing the results of featureCounts for both alignments, 42% of BWA reads are annotated, while only 12% are from the output of STAR. I will note that the % of annotated reads in my reference genome is also ~42%.

While I am being busy now trying to optimize STAR mapping parameters to improve the number of mapped reads, I cannot find a good explanation why BWA overperforms so much STAR in my case. Do you possibly have an idea why this could happen or what parameter can I play with to obtain better results with STAR?

Thank you in advance!

Rita

Some screenshots:

IGV: image

Output of STAR:

image

alexdobin commented 3 years ago

Hi Rita,

the % of mapped reads depends strongly on the filtering parameters. Since the mismatch rate is quite high (divergent genome?) and the reads are quite long (~300b), my first suggestion would be to increase the allowed mismatches: --outFilterMismatchNoverLmax 0.1 --outFilterMismatchNmax 30

Note that STAR will only output concordant paired-end alignments by default, but I am not sure about BWA. This may explain some of the differences.

Cheers Alex

RitaBogorad commented 3 years ago

Hello, Alex!

Thank you very much for your reply! In my case those settings (that I found in similar threads) helped to recover more reads --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3 --outFilterMatchNmin 20

alexdobin commented 3 years ago

Hi Rita,

--outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3 control another alignment filter which sets the minimum aligned length. Note that such settings will allow short alignments (including single-end alignments).

Cheers Alex