alexdobin / STAR

RNA-seq aligner
MIT License
1.77k stars 495 forks source link

What does the "% of reads unmapped: other" mean in Log.final.out #506

Closed SeadonXing closed 4 years ago

SeadonXing commented 5 years ago

Dear Alex : My question is the same to the title. May someone ask the same question but I cant find everywhere except here : https://pipe-star.readthedocs.io/en/latest/explain_star.html But I think the explain may something wrong and I cant get the points. Can you give me some examples that which reads will be classed as " unmapped : others " . And BTW, have you every wrote the official manual of the every items meaning in the Log.final.out files ?

Best Seadon

alexdobin commented 5 years ago

Hi Seadon,

no, unfortunately, I do not have a detailed write-up for these. The explanations they have are not bad, just maybe not entirely clear. The unmapped-other indeed means unmapped for reasons other than "too short" or "too many mismatches". Most commonly in this category, STAR cannof find good seeds, i.e. all seeds map too many times (50 by default), which can also be thought of as all seeds being too short. This may be caused by reads coming from highly repetitive regions of the genomes, but also can be caused by contamination with unrelated species.

Cheers Alex

samiraghazali commented 3 years ago

Dear Alex,

Interested in repetitive elements, I increased the parameter --outFilterMultimapNmax to 1000. And as recommended by you in TEtranscripts, I set the parameter --outAnchorMultimapNmax to 2000 (twice the value set for -outFilterMultimapNmax ). I thought it would allow STAR to search for all the possible seeds ( maximum 2000), if the entire read is mapped after this to more than 1000 locations, it is considered as unmapped and added to the category % of reads mapped to too many loci. However, using these parameters, 30% of my reads are unmapped for other reasons than too many mismatches or too short.

Keeping --outAnchorMultimapNmax by default, I obtained : 12% of reads mapped to multiple loci, 0% of reads mapped to too many loci and 5,42% of reads unmapped : other .

Would you please explain to me how does --outAnchorMultimapNmax affect the mapping ? Is there a maximum value for it same way as the minimum (50)?

Thank you in advance for your help

Samira

alexdobin commented 3 years ago

Hi Samira,

--winAnchorMultimapNmax (not --outAnchorMultimapNmax) determines the number of loci a seed (i.e. exactly matching part of the read can map to). If a read is supposed to map to N locations, then --winAnchorMultimapNmax has > N, 2*N is a good starting point.

If you make --outAnchorMultimapNmax too large, you might hit other limitations that prevent reads from being mapped - also it will slow down the mapping. This is why I recommend increasing it gradually and checking the results. To prevent these limitations (if you really want to go far with the--outAnchorMultimapNmax), you can try to increase these parameters: --alignWindowsPerReadNmax : from default 10000 --alignTranscriptsPerWindowNmax: from default 100 --seedPerWindowNmax: from default 50 --seedNoneLociPerWindow: from default 10

You can also try --alignIntronMax 1 which will prohibit splicing. This may be needed for repeated elements mapping if they tend to appear within the standard splice window of ~600kb. If you are interested in spliced alignment, you will have to do a 2-step alignment: first, map with splicing allowed and small value of --outAnchorMultimapNmax, and then remap unmapped reads with large --outAnchorMultimapNmax and --alignIntronMax 1.

Cheers Alex

samiraghazali commented 3 years ago

Dear Alex,

My problem is solved. Thank you for your help.

Best regards,

Samira

wanisajad commented 3 weeks ago

Hi Alex, I am interested in transposable elements and use STAR for both aligning RNA-seq and ChIP-seq data and got confuses regarding % of reads mapped to multiple loci or % of reads unmapped: othe I have observed difference in mapping statistics between my RNA-seq and ChIP-seq (H3K27ac and H3K9me3) data when retaining multimapped reads. Specifically, in RNA-seq, I see a high percentage of "Reads Mapped to Multiple Loci" (around 32-36%) and a negligible percentage of "Reads Unmapped: Other." In contrast, for ChIP-seq, the "Reads Mapped to Multiple Loci" is much lower (around 2-6%), while the "Reads Unmapped: Other" is significantly higher (around 7-20%). Could you explain why this difference occurs. Also H3K9me3 has higher percentage of % of reads unmapped: other than H3K27ac or input. Is this difference due to sequencing techniques or due to analysis parameters? Also which reads are coming from highly repetitive regions of the genomes, % of reads mapped to multiple loci or % of reads unmapped: other

Parameters for RNA Seq --runMode alignReads \ --outFilterMultimapNmax 5000 \ --outFilterMismatchNmax 3 \ --outMultimapperOrder Random \ --winAnchorMultimapNmax 5000 \ --alignEndsType EndToEnd \ --alignIntronMax 1 \ --alignMatesGapMax 350 \ --seedSearchStartLmax 30 \ --alignTranscriptsPerReadNmax 30000 \ --alignWindowsPerReadNmax 30000 \ --alignTranscriptsPerWindowNmax 300 \ --seedPerReadNmax 3000 \ --seedPerWindowNmax 300 \ --seedNoneLociPerWindow 1000

Parameters for ChIP Seq --runMode alignReads \ --outFilterMultimapNmax 5000 \ --outMultimapperOrder Random \ --outFilterMismatchNmax 2 \ --winAnchorMultimapNmax 5000 \ --alignEndsType EndToEnd \ --alignIntronMax 1 \ --alignMatesGapMax 500 \ --peOverlapNbasesMin 20 Thanks Sajad @alexdobin