alexdobin / STAR

RNA-seq aligner
MIT License
1.85k stars 505 forks source link

Too many reads unmapped by STAR ("too short") but they seem to map to rRNA genes #1617

Open vkkodali opened 2 years ago

vkkodali commented 2 years ago

I am trying to align the reads from SRR9640925 to this tick genome. I generated a genome index as follows:

$ STAR --version
2.7.8a
$ STAR \
  --runThreadN 8 --runMode genomeGenerate \
  --genomeDir GCF_013339745.2.index --genomeFastaFiles GCF_013339745.2.fa

Then, I aligned the reads using STAR as follows:

$ STAR \
  --runThreadN 16 --genomeDir GCF_013339745.2.index \
  --readFilesIn SRR9640925_1.fastq SRR9640925_2.fastq \
  --alignIntronMax 600000 --outFilterMismatchNoverLmax 0.15 \
  --alignSJoverhangMin 8 --outFilterMultimapNmax 20 \
  --outFilterMismatchNmax 50 --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes NH HI AS nM NM MD jM jI XS MC \
  --outSAMunmapped Within --outFileNamePrefix SRR9640925.

My final log indicates that >35% of the reads were unmapped:

                          Number of input reads |       41351845
                      Average input read length |       202
                                    UNIQUE READS:
                   Uniquely mapped reads number |       9597692
                        Uniquely mapped reads % |       23.21%
                          Average mapped length |       193.19
                       Number of splices: Total |       2389862
            Number of splices: Annotated (sjdb) |       0
                       Number of splices: GT/AG |       2357687
                       Number of splices: GC/AG |       27230
                       Number of splices: AT/AC |       421
               Number of splices: Non-canonical |       4524
                      Mismatch rate per base, % |       5.49%
                         Deletion rate per base |       0.14%
                        Deletion average length |       2.58
                        Insertion rate per base |       0.08%
                       Insertion average length |       2.34
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       2036822
             % of reads mapped to multiple loci |       4.93%
        Number of reads mapped to too many loci |       11735844
             % of reads mapped to too many loci |       28.38%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |       5819
       % of reads unmapped: too many mismatches |       0.01%
            Number of reads unmapped: too short |       14627517
                 % of reads unmapped: too short |       35.37%
                Number of reads unmapped: other |       3348151
                     % of reads unmapped: other |       8.10%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

I aligned the paired-end reads separately using the same parameters and the results are not that much different. I still see ~30% of the reads being unmapped due to "too short" and ~17% of the reads unmapped due to "other" reason. I then extracted the reads that were not mapped by STAR and aligned ~60k paired-end reads using MagicBlast and found a large number of them to be aligning to rRNA genes without any mismatches. I am wondering why STAR failed to align these reads. Is there a specific parameter that needs to be changed such that these reads can be aligned?

vkkodali commented 2 years ago

Interestingly, if I take those ~60k paired-end reads that my first STAR run returned as unmapped and align them as if they were single-end using STAR, I am able to align at least a portion of them.

                          Number of input reads |       1250000
                      Average input read length |       101
                                    UNIQUE READS:
                   Uniquely mapped reads number |       20355
                        Uniquely mapped reads % |       1.63%
                          Average mapped length |       98.53
                       Number of splices: Total |       28
            Number of splices: Annotated (sjdb) |       0
                       Number of splices: GT/AG |       28
                       Number of splices: GC/AG |       0
                       Number of splices: AT/AC |       0
               Number of splices: Non-canonical |       0
                      Mismatch rate per base, % |       3.61%
                         Deletion rate per base |       0.01%
                        Deletion average length |       1.37
                        Insertion rate per base |       0.00%
                       Insertion average length |       1.29
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       1834
             % of reads mapped to multiple loci |       0.15%
        Number of reads mapped to too many loci |       466090
             % of reads mapped to too many loci |       37.29%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |       0
       % of reads unmapped: too many mismatches |       0.00%
            Number of reads unmapped: too short |       165376
                 % of reads unmapped: too short |       13.23%
                Number of reads unmapped: other |       596345
                     % of reads unmapped: other |       47.71%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

Over half of the reads are still unmapped... but all of them were unmapped before. So... what's going on?

alexdobin commented 2 years ago

Hi @vkkodali

a large portion of you reads map to "too many" loci: % of reads mapped to too many loci | 28.38% These could be the rRNA reads that often have many sequence-similar loci in the genome. You can move them into the multi-mapping category by increasing --outFilterMultimapNmax from the default 10.

vkkodali commented 2 years ago

Thank you @alexdobin I already have the parameter --outFilterMultimapNmax set at 20. Increasing it further to 50 does not make much difference. My final command line was:

STAR   \
  --runThreadN 16 --genomeDir GCF_013339745.2.index \
  --readFilesIn SRR9640925_1.fastq   SRR9640925_2.fastq \
  --outFileNamePrefix SRR9640925. --outSAMtype BAM   SortedByCoordinate \
  --outSAMattributes NH   HI   AS   nM   NM   MD   jM   jI   XS   MC --outSAMunmapped Within \
  --outFilterMultimapNmax 50 --outFilterMismatchNmax 50 --outFilterMismatchNoverLmax 0.15 \
  --alignIntronMax 600000 --alignSJoverhangMin 8

And the Log.final.out has the following:

                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       13669162
             % of reads mapped to multiple loci |       33.06%
        Number of reads mapped to too many loci |       103504
             % of reads mapped to too many loci |       0.25%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |       5819
       % of reads unmapped: too many mismatches |       0.01%
            Number of reads unmapped: too short |       14627517
                 % of reads unmapped: too short |       35.37%
                Number of reads unmapped: other |       3348151
                     % of reads unmapped: other |       8.10%
vkkodali commented 2 years ago

I am seeing more runs where STAR fails to align a significant number of reads. For example, SRR11216076 is a Illumina paired-end dataset with 250bp reads and when I try to align these reads to Arabidopsis genome, over 90% of the reads end up in the "unmapped: too short" category. Similarly, SRR7459873 is a BGI-SEQ dataset with reads ranging from 300-562 bp (avg 468 bp); every single one of the ~19000 reads were "unmapped: too short". How is this possible? In both of these cases, I was using the following parameters:

STAR --runThreadN 8 --genomeDir GCA_023115395.1.index --readFilesIn fastq_files/${sra_acc}.fastq \
  --alignIntronMax 100000 --outFilterMismatchNoverLmax 0.1 --alignSJoverhangMin 8 \
  --outFilterMultimapNmax 20 --outFilterMismatchNmax 50 --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes NH HI AS nM NM MD jM jI XS MC --outFileNamePrefix star_aligns/${sra_acc}.

If I align the BGI-SEQ reads using minimap2, I see all of them align. Almost all of the alignments cluster in a handful of regions suggesting that these reads are coming from only a few genes. Still, it is surprising to find all of them being skipped by STAR.

alexdobin commented 2 years ago

Hi @vkkodali

for reads longer than 300b you will need to use STARlong. I would also recommend to check the reads that were mapped by minimap2 but not STAR, to see if they have high error rate or short mapping length.

Cheers Alex