High % of reads unmapped: too short in subset of samples- recommendations for troubleshooting?

jpropa25 commented 5 years ago

Alex,

Thanks again for your work on developing such a great tool and for your continuous support for those of us using it. I have an issue that seems like others have fairly regularly and I was wondering if you had any recommendations for it.

I did PE RNA-seq on 20 samples (each sample split to 2 lanes, so all documents you see here will refer to 40 samples or 80 samples for the PE fastq files). These are low-input libraries, so some difficulties are to be expected. These fastq files were trimmed for quality, to remove NovaSeq sequencing artifacts, and polyA tails. My sequences still have fairly long reads (average ~75). Here's where I encounter the problem. In about 22 of these 80 samples, I have a very large percentage of reads unmapped: too short, despite the high quality and length of the sequences. So question 1: do you have any guesses as to what might lead to this? Do you think it is possible that a contamination from a different species would lead to this?

Secondly, I have a question about how exactly the --outFilterScoreMinOverLread and --outFilterMatchNminOverLread work. For instance, if you are running these with a value of 0.6, does that mean that if your paired read length is 150, then it is dumping everything that is less than 150*0.6=90 length in alignment? I noticed in issue #169 you mentioned that you can reduce this number to 0.3- would that then make the minimum aligned length in my example 45? If I'm understanding this correctly, would you have a potential recommendation on what you would consider the shortest allowable read (i.e. what you would feel comfortable with in your analyses) and what would be your interpretation of >50% of the reads not aligning at least 60% of their read length to the reference genome?

I have attached here the log files for my genome generation, a multiQC file showing my FastQC reports for the trimmed fastq files that were used for alignment, a multiQC file showing my STAR alignment results, and an example log from my STAR alignment. Any input would be very much appreciated.

Best, Jim

Screen Shot 2019-09-06 at 9 26 19 AM Screen Shot 2019-09-06 at 9 25 50 AM Screen Shot 2019-09-06 at 9 28 11 AM

STAR_genome_generate.log 12_Lepr_plus_MPP_062119_S34_L003Log.final.out.log

alexdobin commented 5 years ago

Hi Jim,

your understanding of --outFilterScoreMinOverLread and --outFilterMatchNminOverLread parameters is correct. Note that the two filters are combined with AND logic, so the strictest filter wins. The recommended values are the default ones, 0.66, i.e. 2/3 of the read length are required to be mapped. Reducing this value will allow you to map reads that are presently "unmapped short", but it will come at the cost of an increase in false alignments, and also many short alignments will be multimappers.

Before you go that path, I would recommend a few more tests:

Map read1 and read2 separately?
Map without trimming, letting STAR trim as necessary. If 1 and/or 2 does not increase mapping rate significantly,
BLAST a few of unmapped reads to check for contamination.

Cheers Alex

jpropa25 commented 5 years ago

Alex,

Thanks for the quick response. I will run as you suggest and get back to you on the results.

pawelqs commented 5 years ago

I have similar problem as jpropa25. I have about 70-80% of samples uniquely mapped and 12-20% unmapped as to short. I have performed trimming using TrimGalore but usually only adapters has been trimmed. Quality trimming was rare and average input length seems to be ok. Without trimming results are significantly worse and very similar when only one file prom the pair was mapped.

Are these values ok? Or should look for way to improve it?

pawelqs commented 5 years ago

Unmapped: to short means that input read was too short, or its length aligned to the genome was too short?

alexdobin commented 5 years ago

Hi Paweł,

Unmapped: too short means the mapped length is too short, i.e. does not pass the mapping filter. 12% unmapped is pretty good, I would not worry too much about it - but I would BLAST a few unmapped reads to check for contaminations and other problems.

Cheers Alex

pawelqs commented 5 years ago

I did it. Here are the results from blastn. blastn results I used primary_assembly fasta files from Ensembl- it should contain all genome, so why most of the checked sequences seem to match ribosomal proteins?

alexdobin commented 5 years ago

Hi Paweł,

this is indeed unexpected. I guess a possible explanation that ribosomal protein genes have a lot of paralogs and pseudogenes, and some of those did not get into the assembly. A couple of more things to try for the same reads - BLASTn against just the mouth genome+transcriptome, and also BLAT against the mouse genome on UCSC browser. This will tell us how mappable these reads are against the genome assembly.

Cheers Alex

pawelqs commented 5 years ago

Hi Alex, Thank you for all advices. I will try to do that when I will find some time. Best, Paweł

alexdobin / STAR

High % of reads unmapped: too short in subset of samples- recommendations for troubleshooting? #731