alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

How to understand “unmapped: other” reads #1493

Open cherrie-g opened 2 years ago

cherrie-g commented 2 years ago

Hi, My question is as the title. I used STAR to map single-end RNA reads to reference. And my final.out is like below:

                      Number of input reads |       314991228
                  Average input read length |       100
                                UNIQUE READS:
               Uniquely mapped reads number |       89149963
                    Uniquely mapped reads % |       28.30%
                      Average mapped length |       97.16
                   Number of splices: Total |       6928745
        Number of splices: Annotated (sjdb) |       6422837
                   Number of splices: GT/AG |       6566863
                   Number of splices: GC/AG |       72028
                   Number of splices: AT/AC |       14874
           Number of splices: Non-canonical |       274980
                  Mismatch rate per base, % |       0.40%
                     Deletion rate per base |       0.09%
                    Deletion average length |       1.81
                    Insertion rate per base |       0.01%
                   Insertion average length |       1.22
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |       7547966
         % of reads mapped to multiple loci |       2.40%
    Number of reads mapped to too many loci |       2506331
         % of reads mapped to too many loci |       0.80%
                              UNMAPPED READS:

Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 88096690 % of reads unmapped: too short | 27.97% Number of reads unmapped: other | 127690278 % of reads unmapped: other | 40.54% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%

I noticed that the "unmapped other" reads proportion was high. But after mapping reads to the NT database, I'm sure that the data was not polluted. So I'm curious that what type of reads would classified as "unmapped other"?

Bests.

alexdobin commented 2 years ago

Hi @cherrie-g

unmapped-other are the reads for which the anchor seeds are not found. The two main reasons for that are (i) poor read quality (ii) "contamination" with sequences not present in the reference genome. If you take a few of these unmapped-other reads, and BLAST them, where do they map?

Best, Alex

cherrie-g commented 2 years ago

Thanks Alex. I mapped the unmapped-other reads using BLAST and they do mapped to my species. One confusing thing was, I tried to mapped my reads to the reference with many softwares, such as HISAT2, bwa. And HISAT2 got a mapping rate of about 44%, which I thought comparable to STAR. But in bwa result, the mapping rate of the same reads set could reach 91%. The bwa result indicated that there was no pollution, but why the mapping rate have such a big difference confused me. I used all these three software with default parameters. So do you know what kind of reason could cause the difference of mapping rate?

Bests.

alexdobin commented 2 years ago

Hi @cherrie-g

please SAM lines for a few reads that are mapped with BWA but not HISAT2/STAR. This may hint us at the issue.

Best, Alex