alexdobin / STAR

RNA-seq aligner
MIT License
1.83k stars 503 forks source link

Low mapping rate in RNA-seq #844

Open BhaktiDwivedi opened 4 years ago

BhaktiDwivedi commented 4 years ago

Hi, I have paired-end (2X100) RNA-seq data of variable post-trimmed length (2X36-100nt). For a good fraction of samples, I am getting very low uniquely mapped reads % and very high% of reads to multiple loci. For example, here is a log final output for one of the sample:

The main parameters I have changed in the STAR aligner v2.7.0e command are:

outSAMtype                        BAM   Unsorted   
outSAMmapqUnique                  60
outFilterMultimapNmax             20
chimSegmentMin                    30
sjdbOverhang                      99

Log.Final.out

                                Started job on |    Feb 21 20:23:16
                             Started mapping on |   Feb 21 20:23:41
                                    Finished on |   Feb 21 20:40:18
       Mapping speed, Million of reads per hour |   181.39

                          Number of input reads |   50234193
                      Average input read length |   160
                                    UNIQUE READS:
                   Uniquely mapped reads number |   11798222
                        Uniquely mapped reads % |   23.49%
                          Average mapped length |   169.54
                       Number of splices: Total |   4951770
            Number of splices: Annotated (sjdb) |   4882258
                       Number of splices: GT/AG |   4897563
                       Number of splices: GC/AG |   38185
                       Number of splices: AT/AC |   4354
               Number of splices: Non-canonical |   11668
                      Mismatch rate per base, % |   0.24%
                         Deletion rate per base |   0.01%
                        Deletion average length |   1.52
                        Insertion rate per base |   0.00%
                       Insertion average length |   1.63
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   30878245
             % of reads mapped to multiple loci |   61.47%
        Number of reads mapped to too many loci |   18843
             % of reads mapped to too many loci |   0.04%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |   0.00%
                 % of reads unmapped: too short |   14.94%
                     % of reads unmapped: other |   0.07%
                                  CHIMERIC READS:
                       Number of chimeric reads |   13999
                            % of chimeric reads |   0.03%

Is there any other possible solution that I can use in STAR aligner to improve the mapping rate?

I understand this could be indicative of insufficient depletion of ribosomal RNA. So I used featureCounts with rRNA repeats annotation from RepeatMasker track to roughly estimate the rRNA levels in these libraries. For the same above sample , looks like, the rRNA mapping is almost 90%

when counting multi-mapping reads:

Total alignments : 126691323 
Successfully assigned alignments : 114589457 (90.4%)

when not counting multi-mapping reads:

Total alignments : 126691323 
Successfully assigned alignments : 2308221 (1.8%)

Is this a good way to explain why I have such high% of multi mapping reads? I would like to use these samples for expression quantification and differential expression analysis. But given such low%, I am not sure if I will get anything significant. Another thought process is to remove or mask rRNA reads from the alignment when quantifying data. Is this reasonably acceptable or what could be potential pitfalls? Appreciate any help and feedback.

Thank you!

alexdobin commented 4 years ago

Hi @BhaktiDwivedi

you found the right explanation - it looks like the rRNAs are the culprit, accounting for the most of the multimappers. It's probably best to discard them before doing gene quantification and DE, to avoid normalization artifacts. The remaining ~12M unique mappers might be enough for DE of medium to highly expressed genes.

Cheers Alex

BhaktiDwivedi commented 4 years ago

Thank you! @alexdobin

capricy commented 2 years ago

Hi @BhaktiDwivedi

you found the right explanation - it looks like the rRNAs are the culprit, accounting for the most of the multimappers. It's probably best to discard them before doing gene quantification and DE, to avoid normalization artifacts. The remaining ~12M unique mappers might be enough for DE of medium to highly expressed genes.

Cheers Alex

I recently got similar issue. Could you elaborate more about how low mapping rate would lead to "normalization artifacts"? My understanding is only uniquely mapped reads are counted...

Thanks a lot.

C.

alexdobin commented 2 years ago

Hi C, some of rRNA reads can map as unique mappers, so if a sample contains a large % of rRNA, and your annotations have rRNA genes, the normalization may get skewed, and it's best to exclude them. Cheers Alex