alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

Stop search for multimappers #1268

Open Roleren opened 3 years ago

Roleren commented 3 years ago

Hi, thanks for all the great work you are doing.

A quick question I could not find the answer to in the manual or google:

Does there exist a similar way in STAR to achive what bowtie does with the -k argument ?

Explained here:

I am filtering out a large list of contaminants in a single fastq file: Silva Ribosomal fastq (2GB), phix, tRNAs, ncRNAs, and I want to keep only all reads that does not align (unmapped file with contaminant reads removed). Problem is now that when it multimaps (usually variants of the rRNA 5S, there are hundreds of them in Silva), it is kept (in the unmapped file), since if it multimaps too much it goes into the "unmapped" file, and I do not want to multimap at all, I only want to see if it hits once, and if it does it should stop checking for multimappers.

So is this possible to achive in STAR ?

alexdobin commented 3 years ago

Hi Håkon

you would need to increase the --outFilterMultimapNmax to a large value, to make it mappable. Also, if you expect more than 50 loci per read, you would also need to increase --winAnchorMultimapNmax to the expected number of loci - however, it will slow down the mapping if it's too big, so I would not go over 200-500. If you do not need the contaminant alignments, you can switch off the SAM output with --outSAMtype None. If you need to keep just one (or a few) contaminant alignments, you can use --outSAMmultNmax 1.

Cheers Alex

Roleren commented 3 years ago

Hm, so STAR has no equivalent to -k ? Setting --outFilterMultimapNmax would only make it find even more multimappers, I need it to stop after first hit, and directly ignore that read. Since I do not output aligned reads.

alexdobin commented 3 years ago

Hi Håkon

no, there is no such option for STAR, owing to a difference in the multimapper treatment. For Bowtie and other DNA aligners, multimappers are a nuisance as they cannot be used for variant calling. As soon as they find that a read maps to 2 loci, it can be discarded. For RNA-seq, multimappers represent the expression of paralogous loci, so they are useful, and STAR tries to find all loci that a read maps to.

Also, I am not exactly sure how the -k option in Bowtie2 works. Once you find 2 loci with equal mapping quality, how do you know that a 3rd locus will not give a better alignment? I guess there is some heuristics there enforcing that the first loci are of sufficient quality. Considering all loci is safer, but of course, it carries some computational overhead.

Cheers Alex

Roleren commented 3 years ago

Ok, I will do some tests and let you know how much mapping is slowed down, when I increase the allowed multimappers, then I will close the issue.