High % of reads unmapped too short in STAR results

ConcettaDe4 commented 3 years ago

Hi! I am using STAR to map my reads to Mouse genome. The RNA-seq libraries are not size-selected and they have been sequenced in 151 PE. When I built the genome index I used the option --sjdbOverhang 150. After the mapping of reads I obtained the following statistics mapping:

         Started job on |       Dec 31 12:57:17
                             Started mapping on |       Dec 31 12:57:38
                                    Finished on |       Dec 31 13:15:39
       Mapping speed, Million of reads per hour |       90.88

                          Number of input reads |       27290661
                      Average input read length |       279
                                    UNIQUE READS:
                   Uniquely mapped reads number |       21943162
                        Uniquely mapped reads % |       80.41%
                          Average mapped length |       281.18
                       Number of splices: Total |       20746559
            Number of splices: Annotated (sjdb) |       20546835
                       Number of splices: GT/AG |       20503681
                       Number of splices: GC/AG |       178304
                       Number of splices: AT/AC |       26543
               Number of splices: Non-canonical |       38031
                      Mismatch rate per base, % |       0.45%
                         Deletion rate per base |       0.02%
                        Deletion average length |       2.50
                        Insertion rate per base |       0.01%
                       Insertion average length |       2.16
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       1147755
             % of reads mapped to multiple loci |       4.21%
        Number of reads mapped to too many loci |       27195
             % of reads mapped to too many loci |       0.10%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |       0
       % of reads unmapped: too many mismatches |       0.00%
            Number of reads unmapped: too short |       4157861
                 % of reads unmapped: too short |       15.24%
                Number of reads unmapped: other |       14688
                     % of reads unmapped: other |       0.05%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

After I got this result, I checked why I got the 15.24% of unmapped reads. Based on other github post, I checked the presence of contamination into my samples, the quality of the reads and I mapped the R1 and R2 separately. When I mapped the reads R1 and R2 separately I got an high percentage of reads mapping. By contrary, I noticed that the R2 of my sample have low quality at the beginning of the sequence (FastQC plot below).

Read R1 Read R2

Considering these plots, I decided to trim the first 10 bases of read R2 and subsequently I ran again reads mapping with same parameters and same genome index.

After reads mapping I obtained the following mapping results:

Number of input reads |       27399325
                      Average input read length |       270
                                    UNIQUE READS:
                   Uniquely mapped reads number |       19490266
                        Uniquely mapped reads % |       71.13%
                          Average mapped length |       273.95
                       Number of splices: Total |       18225887
            Number of splices: Annotated (sjdb) |       18045447
                       Number of splices: GT/AG |       18008414
                       Number of splices: GC/AG |       156666
                       Number of splices: AT/AC |       23652
               Number of splices: Non-canonical |       37155
                      Mismatch rate per base, % |       0.46%
                         Deletion rate per base |       0.02%
                        Deletion average length |       2.48
                        Insertion rate per base |       0.01%
                       Insertion average length |       2.14
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       964650
             % of reads mapped to multiple loci |       3.52%
        Number of reads mapped to too many loci |       23413
             % of reads mapped to too many loci |       0.09%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |       0
       % of reads unmapped: too many mismatches |       0.00%
            Number of reads unmapped: too short |       6904371
                 % of reads unmapped: too short |       25.20%
                Number of reads unmapped: other |       16625
                     % of reads unmapped: other |       0.06%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

I was wondering why the mapping rate decrease about 10% when I trim first 10 bases of reads R2. Trimming the low quality bases of R2 I was expecting to increase the quality mapping. Thank you for your help.

Concetta

alexdobin commented 3 years ago

Hi Concetta,

your libraries are 2x150, but Average input read length = 279, which means you are doing some (adapter?) trimming the reads before mapping in the first case? So when you are doing the 2nd trimming of 10b, is it done after the 1st trimming? You are right - trimming bases should not result in an increase of the unmapped reads %.

Cheers Alex

ConcettaDe4 commented 3 years ago

Hi! Thank you for your answer. For both alignment, before reads mapping I performed reads trimming. In particular in the second alignment I trimmed at 5'end of R2 read the first 10 bases. Subsequently I performed reads trimming to remove the adapter. After adapter trimming, I mapped the trimmed reads to the genome.

Looking at bam files I observed that PE reads with overlapping mates (short insert size) that have been aligned in the first mapping are unmapped in the second run of alignment (mapping with R2 trimmed at 5'end) .

I deduced that the low reads mapping rate in the second alignment could be due to the short insert size. So to improve mapping rates I performed a third alignment including in the command the option --peOverlapNbasesMin 5. Looking at the bam file overlapping reads mapped in the first mapping now (in the third alignment) are mapped again. Below there is a screenshot from genome browser of the alignment of overlapping reads.

In this case it is correct to use the option --peOverlapNbasesMin? In your opinion is it correct to put this option equal to 5 (--peOverlapNbasesMin 5)?

Thank you.

Concetta

alexdobin commented 3 years ago

Hi Concetta,

you found the correct explanation:

Looking at bam files I observed that PE reads with overlapping mates (short insert size) that have been aligned in the first mapping are unmapped in the second run of alignment (mapping with R2 trimmed at 5'end) .

I deduced that the low reads mapping rate in the second alignment could be due to the short insert size.

The problem with 5' trimming is that for mates with insert size < read length, one of the mates will be contained inside the other mate, which is not allowed with default parameters. --peOverlapNbasesMin 5 is definitely a good solution, though it slows down the mapping as the reads need to be mapped twice - as paired-end and merged single-end. A simpler solution could be to specify --alignEndsProtrude 10 (or even more if you trim adapter on 5').

Cheers Alex

alexdobin / STAR

High % of reads unmapped too short in STAR results #1131