Open mholtz2 opened 1 year ago
Hi @mholtz2
Reduction of mapping filter stringency indeed results in an increase in multi-mapping as well as an increase in false alignments. This is the usual trade-off between sensitivity and precision. I generally recommend against it. The best approach is to figure out the cause(s) for reads not mapping. One possibility for relatively long reads that you have is that they contain long adapter sequences - then the solution will be to trim the adapters.
Hi Alex,
Thanks for your reply. I tried trimming adaptors with Trim Galore! and here are the stats now (same sample):
Started job on | Apr 13 10:48:48
Started mapping on | Apr 13 10:49:05
Finished on | Apr 13 11:03:26
Mapping speed, Million of reads per hour | 408.16
Number of input reads | 97618581
Average input read length | 271
UNIQUE READS:
Uniquely mapped reads number | 74186546
Uniquely mapped reads % | 76.00%
Average mapped length | 268.00
Number of splices: Total | 48105182
Number of splices: Annotated (sjdb) | 47572212
Number of splices: GT/AG | 47528183
Number of splices: GC/AG | 404568
Number of splices: AT/AC | 57554
Number of splices: Non-canonical | 114877
Mismatch rate per base, % | 0.41%
Deletion rate per base | 0.01%
Deletion average length | 2.01
Insertion rate per base | 0.06%
Insertion average length | 4.76
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 8440337
% of reads mapped to multiple loci | 8.65%
Number of reads mapped to too many loci | 129707
% of reads mapped to too many loci | 0.13%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 14769194
% of reads unmapped: too short | 15.13%
Number of reads unmapped: other | 92797
% of reads unmapped: other | 0.10%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
Overall, the Uniquely mapped reads % went up by 10-20% across all my samples. Some are still around 60%. Do you think this is sufficient?
Hi @mholtz2
This looks quite good now. The best samples have <5% unmapped reads, so 15% is still a bit high. You can try explore the reads that did not map - remap them with BLAST.
Hi Alex, I'm not familiar with using blast on the command line and it wasn't taking the unmapped reads file very well, so I just looked at some on the website. Most of the reads don't map to any gene in mouse (which my samples are from), and if they do it's with a fairly low score. Should I assume that most of the reads are contaminant genes?
Hi @mholtz2
I did not mean for you to use BLAST from command line, it's actually quite cumbersome. But you can map a few reads on the BLAST website, to the "nt" database to see what kind of contamination you may be getting.
I'm concerned about the quality of my aligned reads, and additionally the method I found in other issues for possibly fixing it. I would like confirmation that I made the right choice based on the statistics shown for my data, and if not, any suggestions for other changes I should make.
I ran STAR alignment for my RNA-seq reads after building an index with the GRCm39 ensembl genome and corresponding GTF file:
Then I mapped reads with this code (in a loop):
Which gave me output log info as such (similar stats across most of my samples):
I saw on other issues where people also had low % of Uniquely mapped reads and most of the unmapped reads being "too short", adding the parameters --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3 helped map more reads. I added these to the code above and got stats like this:
Although I successfully increased the number of mapped genes, I see that the % of reads mapped to multiple loci also increased. I am fairly new to bioinformatics so any help in interpreting my original results and understanding if the change in parameters I made was the right decision. Thanks in advance!