Too many unmapped short reads - how to map raw reads instead?

deekshamisri commented 5 months ago

Hi,

I am trying to use Sargasso to separate some mouse-human co-cultured cells. There are a large % of unmapped (short) reads in the STAR output log files. Since I do not control the STAR call, I am not sure how to map the untrimmed or less trimmed reads in Sargasso?

lweasel commented 5 months ago

Hi,

Many thanks for your interest in Sargasso, and I hope we can get this working properly for you!

We don't actually do any trimming of the raw reads before passing them to STAR. However, when separating a pair of species which are relatively distantly related (such as mouse and human), I think that it might not be unusual to see a large number of reads in the STAR log files in the category "% of reads unmapped: too short" – for example, these might be human sequences for which only a very short subsequence is found to match the mouse genome, and so are completely rejected by STAR (or vice versa, for mouse sequences which fail to map to the human genome).

To help us work out if this is indeed the case, there are a couple of files (or their contents) that would be really helpful to look at, if you are willing to share:

for a single sample, the STAR log file for mapping to the mouse - this should be in the mapped_reads subdirectory of the Sargasso results directory, e.g. something like sample.mouse.log.out
for the same sample, the STAR log file for mapping to human, something like sample.human.log.out
in the filtered_reads sub-directory of the Sargasso results directory, the file overall_filtering_summary.txt

It would also be very helpful to see the Sargasso command line that you are running, if you are willing to share.

Many thanks!

deekshamisri commented 5 months ago

Hi,

Thank you for responding. Sargasso does not trim reads, but when after it invokes STAR, does STAR trim reads before alignment and subsequent species assignment? When I looked up STAR log files format, it said the the % or number of reads unmapped due to being to short was due to the length of the reads and not that the length of alignment was too short to be mapped. I am asking this because I am getting very high % of unmapped reads (50-90%, especially to human genome). Attaching the files that you requested.

[overall_filtering_summary.txt] (https://github.com/biomedicalinformaticsgroup/Sargasso/files/14773611/overall_filtering_summary.txt)

For some reason I am not able to attach the .log.out files here so I am copying the contents of the human.log.out and mouse.log.out files here.

human.log.out Started job on | Feb 17 17:29:24 Started mapping on | Feb 17 17:29:58 Finished on | Feb 17 17:47:27 Mapping speed, Million of reads per hour | 102.71

                      Number of input reads |   29929372
                  Average input read length |   100
                                UNIQUE READS:
               Uniquely mapped reads number |   4770298
                    Uniquely mapped reads % |   15.94%
                      Average mapped length |   97.56
                   Number of splices: Total |   1577074
        Number of splices: Annotated (sjdb) |   1569318
                   Number of splices: GT/AG |   1561523
                   Number of splices: GC/AG |   12609
                   Number of splices: AT/AC |   1643
           Number of splices: Non-canonical |   1299
                  Mismatch rate per base, % |   3.56%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.71
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.47
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   581514
         % of reads mapped to multiple loci |   1.94%
    Number of reads mapped to too many loci |   0
         % of reads mapped to too many loci |   0.00%
                              UNMAPPED READS:

mouse.log.out Started job on | Feb 17 03:32:25 Started mapping on | Feb 17 03:32:54 Finished on | Feb 17 03:42:08 Mapping speed, Million of reads per hour | 194.49

                      Number of input reads |   29929372
                  Average input read length |   100
                                UNIQUE READS:
               Uniquely mapped reads number |   21838035
                    Uniquely mapped reads % |   72.97%
                      Average mapped length |   99.72
                   Number of splices: Total |   8100507
        Number of splices: Annotated (sjdb) |   8072346
                   Number of splices: GT/AG |   8033879
                   Number of splices: GC/AG |   57864
                   Number of splices: AT/AC |   6020
           Number of splices: Non-canonical |   2744
                  Mismatch rate per base, % |   0.17%
                     Deletion rate per base |   0.00%
                    Deletion average length |   1.40
                    Insertion rate per base |   0.00%
                   Insertion average length |   1.22
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   1369194
         % of reads mapped to multiple loci |   4.57%
    Number of reads mapped to too many loci |   0
         % of reads mapped to too many loci |   0.00%
                              UNMAPPED READS:

sargasso call species_separator rnaseq --reads-base-dir=/path/coculture/bulkRNAseq/Sargasso/ --best --num-threads=4 --run-separation manifest_mixed.tsv sargasso_res_mixed mouse /path/coculture/STAR/mouse/genome human /path/coculture/STAR/human/genome

lweasel commented 5 months ago

Hi,

Many thanks for your quick reply and the helpful info. I feel that some of the STAR documentation is a little bit ambiguous about the meaning of reads unmapped due to being too short - my belief (though I could certainly be wrong!) is that reads which fail to meet the threshold of "--outFilterMatchNminOverLread" (default 0.66 - i.e. at least 2/3rds of bases must "match") will be counted as unmapped due to being short. I would expect this to be the case for a lot of, e.g., human reads when mapped to mouse.

As a comparison, I've been having a look at some of our own data which is a human/mouse mix. In the particular sample I'm looking at (according to the contents of overall_filtering_summary.txt, where I am looking at the columns Assigned-Reads-mouse and Assigned-Reads-human), the ratio of mouse to human reads is about 4:1. In that same sample, examining the STAR log output, we are getting 71.6% reads uniquely mapped to mouse (Uniquely mapped reads %) with 24.7% reads unmapped to mouse due to being too short (% of reads unmapped: too short), with 29.3% reads uniquely mapped to human and 60.0% reads unmapped to human due to being too short.

I think that these percentages are not too different from yours, if the ratio of mouse to human reads in your data is a little higher (i.e. essentially the balance between mouse and human tissue/cells in your input material) - and it looks like this might be the case from your overall_filtering_summary.txt.

So, I think that although the "% of reads unmapped: too short" figures do initially look worrying in your files, these are actually not too dissimilar from the numbers that we have previously seen when separating between mouse and human. So I think that your results may actually be fine – but please do let us know if you wish us to investigate further or have follow-up questions?

deekshamisri commented 5 months ago

Hi Owen,

I apologize for the delay in my response. Thank you for clarifying. Is there any reason why a larger number of reads align to the mouse genome compared to the human genome?

lweasel commented 5 months ago

Hi - I think that in the case of separation between mouse and human genomes, where the "quality" of each reference genome is roughly similar (i.e. they're both very good), and where the species are reasonably far apart evolutionarily, I would expect that the numbers of reads aligning to each genomes would mostly reflect the amount of material from each species that has been sequenced.

i.e. here there is a roughly 4.5:1 ratio of reads mapping to mouse compared to human, and so I might expect that there would have been a somewhat similar ratio of RNA sequenced, and the most likely explanation for this would simply be a (very roughly) similar imbalance in the number of mouse and human cells in your co-cultures. Does this sound plausible, or is that contrary to your observations of the co-cultures?

deekshamisri commented 5 months ago

Hi Owwn, unfortunately we don't have cell proportion numbers in the co-cultures, which is why I was asking you if this kind of ratio is acceptable. Thank you for your help!

lweasel commented 5 months ago

Yes, this kind of ratio is at the level that we've frequently observed in our own data – in fact I would say that we more often see an imbalance like this than we see more equal numbers aligning to the two species.

If it's ok then, I'll close this issue? But please feel free to reopen if you have further questions etc.

biomedicalinformaticsgroup / Sargasso

Too many unmapped short reads - how to map raw reads instead? #111