Closed deekshamisri closed 5 months ago
Hi,
Many thanks for your interest in Sargasso, and I hope we can get this working properly for you!
We don't actually do any trimming of the raw reads before passing them to STAR. However, when separating a pair of species which are relatively distantly related (such as mouse and human), I think that it might not be unusual to see a large number of reads in the STAR log files in the category "% of reads unmapped: too short" – for example, these might be human sequences for which only a very short subsequence is found to match the mouse genome, and so are completely rejected by STAR (or vice versa, for mouse sequences which fail to map to the human genome).
To help us work out if this is indeed the case, there are a couple of files (or their contents) that would be really helpful to look at, if you are willing to share:
mapped_reads
subdirectory of the Sargasso results directory, e.g. something like sample.mouse.log.out
sample.human.log.out
filtered_reads
sub-directory of the Sargasso results directory, the file overall_filtering_summary.txt
It would also be very helpful to see the Sargasso command line that you are running, if you are willing to share.
Many thanks!
Hi,
Thank you for responding. Sargasso does not trim reads, but when after it invokes STAR, does STAR trim reads before alignment and subsequent species assignment? When I looked up STAR log files format, it said the the % or number of reads unmapped due to being to short was due to the length of the reads and not that the length of alignment was too short to be mapped. I am asking this because I am getting very high % of unmapped reads (50-90%, especially to human genome). Attaching the files that you requested.
[overall_filtering_summary.txt] (https://github.com/biomedicalinformaticsgroup/Sargasso/files/14773611/overall_filtering_summary.txt)
For some reason I am not able to attach the .log.out files here so I am copying the contents of the human.log.out and mouse.log.out files here.
human.log.out
Started job on | Feb 17 17:29:24
Started mapping on | Feb 17 17:29:58
Finished on | Feb 17 17:47:27
Mapping speed, Million of reads per hour | 102.71
Number of input reads | 29929372
Average input read length | 100
UNIQUE READS:
Uniquely mapped reads number | 4770298
Uniquely mapped reads % | 15.94%
Average mapped length | 97.56
Number of splices: Total | 1577074
Number of splices: Annotated (sjdb) | 1569318
Number of splices: GT/AG | 1561523
Number of splices: GC/AG | 12609
Number of splices: AT/AC | 1643
Number of splices: Non-canonical | 1299
Mismatch rate per base, % | 3.56%
Deletion rate per base | 0.01%
Deletion average length | 1.71
Insertion rate per base | 0.01%
Insertion average length | 1.47
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 581514
% of reads mapped to multiple loci | 1.94%
Number of reads mapped to too many loci | 0
% of reads mapped to too many loci | 0.00%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 24576294 % of reads unmapped: too short | 82.11% Number of reads unmapped: other | 1266 % of reads unmapped: other | 0.00% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%
mouse.log.out
Started job on | Feb 17 03:32:25
Started mapping on | Feb 17 03:32:54
Finished on | Feb 17 03:42:08
Mapping speed, Million of reads per hour | 194.49
Number of input reads | 29929372
Average input read length | 100
UNIQUE READS:
Uniquely mapped reads number | 21838035
Uniquely mapped reads % | 72.97%
Average mapped length | 99.72
Number of splices: Total | 8100507
Number of splices: Annotated (sjdb) | 8072346
Number of splices: GT/AG | 8033879
Number of splices: GC/AG | 57864
Number of splices: AT/AC | 6020
Number of splices: Non-canonical | 2744
Mismatch rate per base, % | 0.17%
Deletion rate per base | 0.00%
Deletion average length | 1.40
Insertion rate per base | 0.00%
Insertion average length | 1.22
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 1369194
% of reads mapped to multiple loci | 4.57%
Number of reads mapped to too many loci | 0
% of reads mapped to too many loci | 0.00%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0 % of reads unmapped: too many mismatches | 0.00% Number of reads unmapped: too short | 6702502 % of reads unmapped: too short | 22.39% Number of reads unmapped: other | 19641 % of reads unmapped: other | 0.07% CHIMERIC READS: Number of chimeric reads | 0 % of chimeric reads | 0.00%
sargasso call
species_separator rnaseq --reads-base-dir=/path/coculture/bulkRNAseq/Sargasso/ --best --num-threads=4 --run-separation manifest_mixed.tsv sargasso_res_mixed mouse /path/coculture/STAR/mouse/genome human /path/coculture/STAR/human/genome
Hi,
Many thanks for your quick reply and the helpful info. I feel that some of the STAR documentation is a little bit ambiguous about the meaning of reads unmapped due to being too short - my belief (though I could certainly be wrong!) is that reads which fail to meet the threshold of "--outFilterMatchNminOverLread" (default 0.66 - i.e. at least 2/3rds of bases must "match") will be counted as unmapped due to being short. I would expect this to be the case for a lot of, e.g., human reads when mapped to mouse.
As a comparison, I've been having a look at some of our own data which is a human/mouse mix. In the particular sample I'm looking at (according to the contents of overall_filtering_summary.txt
, where I am looking at the columns Assigned-Reads-mouse
and Assigned-Reads-human
), the ratio of mouse to human reads is about 4:1. In that same sample, examining the STAR log output, we are getting 71.6% reads uniquely mapped to mouse (Uniquely mapped reads %
) with 24.7% reads unmapped to mouse due to being too short (% of reads unmapped: too short
), with 29.3% reads uniquely mapped to human and 60.0% reads unmapped to human due to being too short.
I think that these percentages are not too different from yours, if the ratio of mouse to human reads in your data is a little higher (i.e. essentially the balance between mouse and human tissue/cells in your input material) - and it looks like this might be the case from your overall_filtering_summary.txt
.
So, I think that although the "% of reads unmapped: too short" figures do initially look worrying in your files, these are actually not too dissimilar from the numbers that we have previously seen when separating between mouse and human. So I think that your results may actually be fine – but please do let us know if you wish us to investigate further or have follow-up questions?
Hi Owen,
I apologize for the delay in my response. Thank you for clarifying. Is there any reason why a larger number of reads align to the mouse genome compared to the human genome?
Hi - I think that in the case of separation between mouse and human genomes, where the "quality" of each reference genome is roughly similar (i.e. they're both very good), and where the species are reasonably far apart evolutionarily, I would expect that the numbers of reads aligning to each genomes would mostly reflect the amount of material from each species that has been sequenced.
i.e. here there is a roughly 4.5:1 ratio of reads mapping to mouse compared to human, and so I might expect that there would have been a somewhat similar ratio of RNA sequenced, and the most likely explanation for this would simply be a (very roughly) similar imbalance in the number of mouse and human cells in your co-cultures. Does this sound plausible, or is that contrary to your observations of the co-cultures?
Hi Owwn, unfortunately we don't have cell proportion numbers in the co-cultures, which is why I was asking you if this kind of ratio is acceptable. Thank you for your help!
Yes, this kind of ratio is at the level that we've frequently observed in our own data – in fact I would say that we more often see an imbalance like this than we see more equal numbers aligning to the two species.
If it's ok then, I'll close this issue? But please feel free to reopen if you have further questions etc.
Hi,
I am trying to use Sargasso to separate some mouse-human co-cultured cells. There are a large % of unmapped (short) reads in the STAR output log files. Since I do not control the STAR call, I am not sure how to map the untrimmed or less trimmed reads in Sargasso?