alexdobin / STAR

RNA-seq aligner
MIT License
1.84k stars 505 forks source link

unmapped reads % high; varies when partially mapped #1822

Open Emmashipman opened 1 year ago

Emmashipman commented 1 year ago

I’m working on using STAR to map my transcript reads to the tetraploid wheat genome. I’m getting some errors I don’t understand.

I am currently using the usegalaxy server as my university server is down for maintenance: not sure if that would be relevant.

Here’s what happens: I do a test run of STAR using one paired-end data set: two files that I put through fastp. I use the entire A and B genomes of wheat from ensembl, which is 15 files, and the GTF for wheat. I leave things at the defaults.

STAR maps each chromosome scaffold separately (not sure if this will be the case if I was not using galaxy and indexed the genome differently, or if I should put it all onto one big fasta file. I suspect it does not matter for my true problem) and gives mapped.bam, log, splice junction, etc. files for each.

When I look at the logs I can see that for each chromosome scaffold, I get this: “% of reads unmapped: too short | 99.63%” or some number close to it.

I also had several errors, that suggested my pre-indexing suffix array was too high at 14, and should be set to 13. I don’t understand this value to be honest, but I obeyed the error message.

I redid, with only 1 chromosome selected this time to save time in a test, and set this value to 13.

I still had a really high % of unmapped reads, about 85%. However, this made a little bit more sense, since I was only using one chromosome: I could expect that a large # of the reads in my sample would be mapped to other chromosomes: in fact, 13% mapped, and 7x13 = 91, about 100%, accounting for differences in chromosome size, errors, it seemed to make sense. Great, I thought I fixed my problem.

I run again the paired-end sample with the entire genome, and the pre-indexing suffix array at 13. This time, I get no errors, but on all the log files, the same problem, of having 99% reads unmapped due to being too short.

I don’t think it’s an issue with my pre-processing software, since the base quality was pretty good and the reads are a normal length.

Here’s an example output: Started job on | Apr 07 21:45:50 Started mapping on | Apr 07 21:45:52 Finished on | Apr 08 00:25:17 Mapping speed, Million of reads per hour | 14.14 Number of input reads | 37559112 Average input read length | 297 UNIQUE READS: Uniquely mapped reads number | 6315 Uniquely mapped reads % | 0.02% Average mapped length | 259.20 Number of splices: Total | 557 Number of splices: Annotated (sjdb) | 221 Number of splices: GT/AG | 338 Number of splices: GC/AG | 13 Number of splices: AT/AC | 0 Number of splices: Non-canonical | 206 Mismatch rate per base, % | 3.81% Deletion rate per base | 0.03% Deletion average length | 2.11 Insertion rate per base | 0.09% Insertion average length | 2.71 MULTI-MAPPING READS: Number of reads mapped to multiple loci | 129638 % of reads mapped to multiple loci | 0.35% Number of reads mapped to too many loci | 892 % of reads mapped to too many loci | 0.00% UNMAPPED READS: Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 37421763 % of reads unmapped: too short | 99.63% Number of reads unmapped: other | 504 % of reads unmapped: other | 0.00% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%

You can see the uniquely mapped reads are at a good length (I assume: I am a novice).

Is this a problem with my genome index? Should I combine the reference genome files into one large file and try again? Should I run each chromosome independently (!!!) since the genome is so large, it could be causing some problem? Or maybe it is my pre-processing software? I read some answers from similar issues, about the different reads not being paired right, and will run Read1 alone in another test, but I don't see why there should be such variation when I run the entire genome, vs. one chromosome.

Thanks for any help.

update: I did finish the mapping of just one read, with the whole genome, instead of the paired end, and got results similar to when I did the paired end read to one chromosome. Still unsure why I would get such a different result when only running STAR to map the paired-end with one chromosome, however.

alexdobin commented 1 year ago

Hi @Emmashipman

The best approach is to map to the whole genome at once. If you map to each chromosome separately, you would need to combine the resulting BAMs and resolve reads that map to different chromosomes. As long as the whole genome index fits in RAM, the mapping time speed will be much faster than when mapping to each chromosome separately.

If Read1 and Read2 mapped separately produce a good mapping rate to the whole genome, then the issue is likely with pre-processing software, which might have screwed the order of reads in Read1 and Read2. I would recommend mapping unprocessed reads to check that.

Emmashipman commented 1 year ago

Thank you so much @alexdobin . I will definitely combine to one genome file. I am working on wheat so I may try with the A and B genomes separately, in case a really high amount of reads map to both versions of a chromosome. This is very helpful and I will also check with the unprocessed reads. Thanks so much for your time.