alexdobin / STAR

RNA-seq aligner
MIT License
1.78k stars 497 forks source link

"Uniquely mapped reads %" low, "% of reads unmapped: too short" high #1823

Open mholtz2 opened 1 year ago

mholtz2 commented 1 year ago

I'm concerned about the quality of my aligned reads, and additionally the method I found in other issues for possibly fixing it. I would like confirmation that I made the right choice based on the statistics shown for my data, and if not, any suggestions for other changes I should make.

I ran STAR alignment for my RNA-seq reads after building an index with the GRCm39 ensembl genome and corresponding GTF file:

STAR --runMode genomeGenerate \
--genomeDir STAR_index \
--genomeFastaFiles $GENOME \
--runThreadN 28 \
--sjdbGTFfile $GTF

Then I mapped reads with this code (in a loop):

STAR --runMode alignReads \
--genomeDir STAR_index \
--readFilesIn IN/${sample}_L1_1${SUFFIX} IN/${sample}_L1_2${SUFFIX} \
--readFilesCommand zcat \
--runThreadN 28 \
--quantMode GeneCounts \
--outFileNamePrefix STAR_mapped/${sample} \
--outSAMtype BAM Unsorted

Which gave me output log info as such (similar stats across most of my samples):

          Started job on |  Apr 11 07:28:47
                             Started mapping on |   Apr 11 07:28:58
                                    Finished on |   Apr 11 07:45:40
       Mapping speed, Million of reads per hour |   355.31

                          Number of input reads |   98894688
                      Average input read length |   300
                                    UNIQUE READS:
                   Uniquely mapped reads number |   56273697
                        Uniquely mapped reads % |   56.90%
                          Average mapped length |   282.46
                       Number of splices: Total |   39595060
            Number of splices: Annotated (sjdb) |   39164698
                       Number of splices: GT/AG |   39124238
                       Number of splices: GC/AG |   331760
                       Number of splices: AT/AC |   48253
               Number of splices: Non-canonical |   90809
                      Mismatch rate per base, % |   0.43%
                         Deletion rate per base |   0.01%
                        Deletion average length |   1.88
                        Insertion rate per base |   0.05%
                       Insertion average length |   4.64
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   5614311
             % of reads mapped to multiple loci |   5.68%
        Number of reads mapped to too many loci |   96047
             % of reads mapped to too many loci |   0.10%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   36845780
                 % of reads unmapped: too short |   37.26%
                Number of reads unmapped: other |   64853
                     % of reads unmapped: other |   0.07%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

I saw on other issues where people also had low % of Uniquely mapped reads and most of the unmapped reads being "too short", adding the parameters --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3 helped map more reads. I added these to the code above and got stats like this:

                                  Started job on |  Apr 11 09:48:36
                             Started mapping on |   Apr 11 09:49:08
                                    Finished on |   Apr 11 10:05:28
       Mapping speed, Million of reads per hour |   264.05

                          Number of input reads |   71880119
                      Average input read length |   300
                                    UNIQUE READS:
                   Uniquely mapped reads number |   53870251
                        Uniquely mapped reads % |   74.94%
                          Average mapped length |   221.85
                       Number of splices: Total |   14701638
            Number of splices: Annotated (sjdb) |   14468539
                       Number of splices: GT/AG |   14455875
                       Number of splices: GC/AG |   129474
                       Number of splices: AT/AC |   11103
               Number of splices: Non-canonical |   105186
                      Mismatch rate per base, % |   1.39%
                         Deletion rate per base |   0.04%
                        Deletion average length |   3.28
                        Insertion rate per base |   0.19%
                       Insertion average length |   4.94
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   10297567
             % of reads mapped to multiple loci |   14.33%
        Number of reads mapped to too many loci |   52875
             % of reads mapped to too many loci |   0.07%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   7639225
                 % of reads unmapped: too short |   10.63%
                Number of reads unmapped: other |   20201
                     % of reads unmapped: other |   0.03%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

Although I successfully increased the number of mapped genes, I see that the % of reads mapped to multiple loci also increased. I am fairly new to bioinformatics so any help in interpreting my original results and understanding if the change in parameters I made was the right decision. Thanks in advance!

alexdobin commented 1 year ago

Hi @mholtz2

Reduction of mapping filter stringency indeed results in an increase in multi-mapping as well as an increase in false alignments. This is the usual trade-off between sensitivity and precision. I generally recommend against it. The best approach is to figure out the cause(s) for reads not mapping. One possibility for relatively long reads that you have is that they contain long adapter sequences - then the solution will be to trim the adapters.

mholtz2 commented 1 year ago

Hi Alex,

Thanks for your reply. I tried trimming adaptors with Trim Galore! and here are the stats now (same sample):

                                 Started job on |   Apr 13 10:48:48
                             Started mapping on |   Apr 13 10:49:05
                                    Finished on |   Apr 13 11:03:26
       Mapping speed, Million of reads per hour |   408.16

                          Number of input reads |   97618581
                      Average input read length |   271
                                    UNIQUE READS:
                   Uniquely mapped reads number |   74186546
                        Uniquely mapped reads % |   76.00%
                          Average mapped length |   268.00
                       Number of splices: Total |   48105182
            Number of splices: Annotated (sjdb) |   47572212
                       Number of splices: GT/AG |   47528183
                       Number of splices: GC/AG |   404568
                       Number of splices: AT/AC |   57554
               Number of splices: Non-canonical |   114877
                      Mismatch rate per base, % |   0.41%
                         Deletion rate per base |   0.01%
                        Deletion average length |   2.01
                        Insertion rate per base |   0.06%
                       Insertion average length |   4.76
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   8440337
             % of reads mapped to multiple loci |   8.65%
        Number of reads mapped to too many loci |   129707
             % of reads mapped to too many loci |   0.13%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   14769194
                 % of reads unmapped: too short |   15.13%
                Number of reads unmapped: other |   92797
                     % of reads unmapped: other |   0.10%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

Overall, the Uniquely mapped reads % went up by 10-20% across all my samples. Some are still around 60%. Do you think this is sufficient?

alexdobin commented 1 year ago

Hi @mholtz2

This looks quite good now. The best samples have <5% unmapped reads, so 15% is still a bit high. You can try explore the reads that did not map - remap them with BLAST.

mholtz2 commented 1 year ago

Hi Alex, I'm not familiar with using blast on the command line and it wasn't taking the unmapped reads file very well, so I just looked at some on the website. Most of the reads don't map to any gene in mouse (which my samples are from), and if they do it's with a fairly low score. Should I assume that most of the reads are contaminant genes?

alexdobin commented 1 year ago

Hi @mholtz2

I did not mean for you to use BLAST from command line, it's actually quite cumbersome. But you can map a few reads on the BLAST website, to the "nt" database to see what kind of contamination you may be getting.