Closed ctb closed 3 years ago
It also does a nice job of illustrating the challenges here b/c there are multiple overlapping species etc. and we can talk about the different numbers that we get out (containment, gather score/unique containment, etc.)
I love that this part would be explicitly addressed! I think that would be a really nice example to showcase that!
OK, I've got a prototype of this all working, here -
https://github.com/ctb/2020-podar-mapping-gather
The workflow does the following:
I'll run it on a bunch of podar genomes soon, but for the paper, I think we need to run it on a bigger database. Should we run it on refseq, or genbank, or ?
Maybe for verisimilitude and comparison we should run it on the same input databases as the mash screen folk did... or maybe not, since we're doing the checking with mapping etc? Dunno.
oh, and we should also consider measuring/evaluating the abundance correlation b/t sourmash gather and mapping.
First run of the above workflow finished - files here, farm:/home/ctbrown/2020-podar
I think fig 4 and fig 5 are key. Curious what others think - are the understandable?
taylorreiter 11:39 AM fig 4 why is the y axis 2?
titus:speech_balloon: 11:39 AM 2e8 :+1: 1
taylorreiter 11:39 AM fig 5 yes — except those big spikes around 30 and 42 are confusion
titus:speech_balloon: 11:40 AM that’s where the genomes are not in the database 11:40 but yes, exactly
taylorreiter 11:40 AM oh cool :slightly_smiling_face:
titus:speech_balloon: 11:40 AM (genomes not in database => k-mer identification not great -> read mapping doesn’t work that well, I think?)
taylorreiter 11:40 AM ya! those are great!
titus:speech_balloon: 11:40 AM I didn’t do this against genbank, just the 64 known genomes :+1: 1
11:41 I’m getting the genbank results. It’s plausible that we’ll identify the closer-to-correct genome sequences now, with the increasing size. 11:41 of genbank, I mean.
whoops, turns out intersect_mins
is against the original query, so it is not monotonically decreasing and does include overlaps with previous hashes. Need to recompute.
(however, conveniently, the difference between the two types will tell us when there are one or more strain variant matches)
a big remaining question for this issue / this section of results: how do we show that we are getting the minimum set of genomes necessary for mapping?
one argument is that we are showing that all of our reads map to these genomes, successively and in decreasing numbers - that is, the unique k-mer containment for gather is monotonically decreasing, and the read mapping matches that pretty well. SO, at each stage, there is no genome that has more hash matches in it than this genome, and hence probably no genome to which more reads can be mapped.
there are two approximations involved in our implementation -
a related question/point to consider is whether in this case the greedy approximation is actually optimal, under the conditions that hold with genome databases. No idea! however, in our implementation, we focus on best containment and do not resolve between two matches with equal containment but different genome sizes, so we may not be optimal in that sense of the word.
to argue this from the other direction -
if there were genomes that had better k-mer containment at any stage of gather, gather would have found them first.
SO the practical question becomes, are we finding them?
result notebooks, for now -
Shaka et al dataset - synthetic/mock community - permalink
probably want an environmental metagenome or two in there? 😬
duh, other point is: we should consider how to recover the unmapped reads per https://github.com/dib-lab/genome-grist/issues/2, and show that they are not identifiable. (not sure how to do this best, but.)
One thing I think we could do, esp for sample p8808mo11, is map all reads against all R. gnavus isolates in genbank (there's like 70 of them or something) and show that the one that gather returns as the first hit is the best matching one. p8808mo11 is suited for this bc r gnavus takes up ~15% of the reads
ooh, nice idea! 👍
another thought: the database prefetch approach here does containment search against every genome in the database, so presents another way to check on gather vs containment of all. (I think it actually dovetails nicely with https://github.com/dib-lab/2020-paper-sourmash-gather/issues/3#issuecomment-717277777 above - it should return all R. gnavus isolates, which we can then map against.)
moving to #15 now that we're focusing on min-set-cover.
In Results section Metagenome sketches can be accurately decomposed into constituent genomes by a greedy algorithm, 'gather', we define and discuss the gather algorithm, but don't present results from it.
Results from gather are presented in the following section, on taxonomy, but that muddies the issue because now we're talking about more stuff (how to turn gather into taxonomy).
Specific ideas -
I think the podar data set is probably a good place to start, because we already discuss it above. It also does a nice job of illustrating the challenges here b/c there are multiple overlapping species etc. and we can talk about the different numbers that we get out (containment, gather score/unique containment, etc.)
The results benchmark I'm thinking of putting in would be something like:
Thoughts?