Open vkkodali opened 2 years ago
Interestingly, if I take those ~60k paired-end reads that my first STAR run returned as unmapped and align them as if they were single-end using STAR, I am able to align at least a portion of them.
Number of input reads | 1250000
Average input read length | 101
UNIQUE READS:
Uniquely mapped reads number | 20355
Uniquely mapped reads % | 1.63%
Average mapped length | 98.53
Number of splices: Total | 28
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 28
Number of splices: GC/AG | 0
Number of splices: AT/AC | 0
Number of splices: Non-canonical | 0
Mismatch rate per base, % | 3.61%
Deletion rate per base | 0.01%
Deletion average length | 1.37
Insertion rate per base | 0.00%
Insertion average length | 1.29
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 1834
% of reads mapped to multiple loci | 0.15%
Number of reads mapped to too many loci | 466090
% of reads mapped to too many loci | 37.29%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 165376
% of reads unmapped: too short | 13.23%
Number of reads unmapped: other | 596345
% of reads unmapped: other | 47.71%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
Over half of the reads are still unmapped... but all of them were unmapped before. So... what's going on?
Hi @vkkodali
a large portion of you reads map to "too many" loci:
% of reads mapped to too many loci | 28.38%
These could be the rRNA reads that often have many sequence-similar loci in the genome.
You can move them into the multi-mapping category by increasing --outFilterMultimapNmax
from the default 10.
Thank you @alexdobin
I already have the parameter --outFilterMultimapNmax
set at 20. Increasing it further to 50 does not make much difference. My final command line was:
STAR \
--runThreadN 16 --genomeDir GCF_013339745.2.index \
--readFilesIn SRR9640925_1.fastq SRR9640925_2.fastq \
--outFileNamePrefix SRR9640925. --outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM NM MD jM jI XS MC --outSAMunmapped Within \
--outFilterMultimapNmax 50 --outFilterMismatchNmax 50 --outFilterMismatchNoverLmax 0.15 \
--alignIntronMax 600000 --alignSJoverhangMin 8
And the Log.final.out
has the following:
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 13669162
% of reads mapped to multiple loci | 33.06%
Number of reads mapped to too many loci | 103504
% of reads mapped to too many loci | 0.25%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 5819
% of reads unmapped: too many mismatches | 0.01%
Number of reads unmapped: too short | 14627517
% of reads unmapped: too short | 35.37%
Number of reads unmapped: other | 3348151
% of reads unmapped: other | 8.10%
I am seeing more runs where STAR fails to align a significant number of reads. For example, SRR11216076 is a Illumina paired-end dataset with 250bp reads and when I try to align these reads to Arabidopsis genome, over 90% of the reads end up in the "unmapped: too short" category. Similarly, SRR7459873 is a BGI-SEQ dataset with reads ranging from 300-562 bp (avg 468 bp); every single one of the ~19000 reads were "unmapped: too short". How is this possible? In both of these cases, I was using the following parameters:
STAR --runThreadN 8 --genomeDir GCA_023115395.1.index --readFilesIn fastq_files/${sra_acc}.fastq \
--alignIntronMax 100000 --outFilterMismatchNoverLmax 0.1 --alignSJoverhangMin 8 \
--outFilterMultimapNmax 20 --outFilterMismatchNmax 50 --outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM NM MD jM jI XS MC --outFileNamePrefix star_aligns/${sra_acc}.
If I align the BGI-SEQ reads using minimap2
, I see all of them align. Almost all of the alignments cluster in a handful of regions suggesting that these reads are coming from only a few genes. Still, it is surprising to find all of them being skipped by STAR.
Hi @vkkodali
for reads longer than 300b you will need to use STARlong. I would also recommend to check the reads that were mapped by minimap2 but not STAR, to see if they have high error rate or short mapping length.
Cheers Alex
I am trying to align the reads from SRR9640925 to this tick genome. I generated a genome index as follows:
Then, I aligned the reads using STAR as follows:
My final log indicates that >35% of the reads were unmapped:
I aligned the paired-end reads separately using the same parameters and the results are not that much different. I still see ~30% of the reads being unmapped due to "too short" and ~17% of the reads unmapped due to "other" reason. I then extracted the reads that were not mapped by STAR and aligned ~60k paired-end reads using MagicBlast and found a large number of them to be aligning to rRNA genes without any mismatches. I am wondering why STAR failed to align these reads. Is there a specific parameter that needs to be changed such that these reads can be aligned?