how do we showcase the accuracy of gather results, without using taxonomy?

ctb commented 3 years ago

In Results section Metagenome sketches can be accurately decomposed into constituent genomes by a greedy algorithm, 'gather', we define and discuss the gather algorithm, but don't present results from it.

Results from gather are presented in the following section, on taxonomy, but that muddies the issue because now we're talking about more stuff (how to turn gather into taxonomy).

Specific ideas -

we could go back to the Podar data set used in the first Results section.
we could use the McIntyre data set that Phil benchmarked a while back (trying to find reference now :) - notebook

I think the podar data set is probably a good place to start, because we already discuss it above. It also does a nice job of illustrating the challenges here b/c there are multiple overlapping species etc. and we can talk about the different numbers that we get out (containment, gather score/unique containment, etc.)

The results benchmark I'm thinking of putting in would be something like:

decomposition of podar metagenome into constitutents, accuracy thereof (compared to read mapping);
(maybe) showcase the unique containment aspect somehow (by e.g. doing a "greedy" mapping of reads).

Thoughts?

taylorreiter commented 3 years ago

It also does a nice job of illustrating the challenges here b/c there are multiple overlapping species etc. and we can talk about the different numbers that we get out (containment, gather score/unique containment, etc.)

I love that this part would be explicitly addressed! I think that would be a really nice example to showcase that!

ctb commented 3 years ago

OK, I've got a prototype of this all working, here -

https://github.com/ctb/2020-podar-mapping-gather

The workflow does the following:

takes metagenome reads, maps reads individually to all input genomes
walks through the input genomes/mapped reads in order of gather results, outputting only "leftover" reads
maps leftover reads to genome

I'll run it on a bunch of podar genomes soon, but for the paper, I think we need to run it on a bigger database. Should we run it on refseq, or genbank, or ?

Maybe for verisimilitude and comparison we should run it on the same input databases as the mash screen folk did... or maybe not, since we're doing the checking with mapping etc? Dunno.

ctb commented 3 years ago

oh, and we should also consider measuring/evaluating the abundance correlation b/t sourmash gather and mapping.

ctb commented 3 years ago

First run of the above workflow finished - files here, farm:/home/ctbrown/2020-podar

ctb commented 3 years ago

results!! https://github.com/ctb/2020-podar-mapping-gather/blob/latest/gather-podar.ipynb

ctb commented 3 years ago

I think fig 4 and fig 5 are key. Curious what others think - are the understandable?

ctb commented 3 years ago

taylorreiter 11:39 AM fig 4 why is the y axis 2?

titus:speech_balloon: 11:39 AM 2e8 :+1: 1

taylorreiter 11:39 AM fig 5 yes — except those big spikes around 30 and 42 are confusion

titus:speech_balloon: 11:40 AM that’s where the genomes are not in the database 11:40 but yes, exactly

taylorreiter 11:40 AM oh cool :slightly_smiling_face:

titus:speech_balloon: 11:40 AM (genomes not in database => k-mer identification not great -> read mapping doesn’t work that well, I think?)

taylorreiter 11:40 AM ya! those are great!

titus:speech_balloon: 11:40 AM I didn’t do this against genbank, just the 64 known genomes :+1: 1

11:41 I’m getting the genbank results. It’s plausible that we’ll identify the closer-to-correct genome sequences now, with the increasing size. 11:41 of genbank, I mean.

ctb commented 3 years ago

whoops, turns out intersect_mins is against the original query, so it is not monotonically decreasing and does include overlaps with previous hashes. Need to recompute.

ctb commented 3 years ago

(however, conveniently, the difference between the two types will tell us when there are one or more strain variant matches)

ctb commented 3 years ago

a big remaining question for this issue / this section of results: how do we show that we are getting the minimum set of genomes necessary for mapping?

one argument is that we are showing that all of our reads map to these genomes, successively and in decreasing numbers - that is, the unique k-mer containment for gather is monotonically decreasing, and the read mapping matches that pretty well. SO, at each stage, there is no genome that has more hash matches in it than this genome, and hence probably no genome to which more reads can be mapped.

there are two approximations involved in our implementation -

scaled minhash means that there may be genomes with more actual k-mers that match, but that is bounded, so it can't be too bad
k-mers do not equal mapping, although the observed correlation is pretty strong, and in two real metagenomes, it looks like mapping is more sensitive than hashes.

a related question/point to consider is whether in this case the greedy approximation is actually optimal, under the conditions that hold with genome databases. No idea! however, in our implementation, we focus on best containment and do not resolve between two matches with equal containment but different genome sizes, so we may not be optimal in that sense of the word.

ctb commented 3 years ago

to argue this from the other direction -

if there were genomes that had better k-mer containment at any stage of gather, gather would have found them first.

SO the practical question becomes, are we finding them?

implementation question: are we buggy, or not? i.e. are we finding all appropriate genomes? we have no good way of k-mer searching all 500,000 genomes, or mapping to all of them, for comparison. we could implement something fairly quickly with khmer, I suppose, but it would take an annoyingly long time to run.
kind-of-an-implementation question: how often are we ignoring a match with equal containment but smaller genome size? hard to answer without modifying sourmash, I guess.
algorithmic question: what am I missing (if anything) about the greedy solution to the min-set-cov problem? under what conditions would we not be looking at an optimal solution here?

ctb commented 3 years ago

result notebooks, for now -

Shaka et al dataset - synthetic/mock community - permalink

p8808mo9 dataset - human microbiome - permalink

p8808mo11 dataset - human microbiome - permalink

ctb commented 3 years ago

probably want an environmental metagenome or two in there? 😬

ctb commented 3 years ago

duh, other point is: we should consider how to recover the unmapped reads per https://github.com/dib-lab/genome-grist/issues/2, and show that they are not identifiable. (not sure how to do this best, but.)

taylorreiter commented 3 years ago

One thing I think we could do, esp for sample p8808mo11, is map all reads against all R. gnavus isolates in genbank (there's like 70 of them or something) and show that the one that gather returns as the first hit is the best matching one. p8808mo11 is suited for this bc r gnavus takes up ~15% of the reads

ctb commented 3 years ago

ooh, nice idea! 👍

ctb commented 3 years ago

another thought: the database prefetch approach here does containment search against every genome in the database, so presents another way to check on gather vs containment of all. (I think it actually dovetails nicely with https://github.com/dib-lab/2020-paper-sourmash-gather/issues/3#issuecomment-717277777 above - it should return all R. gnavus isolates, which we can then map against.)

ctb commented 3 years ago

moving to #15 now that we're focusing on min-set-cover.

dib-lab / 2020-paper-sourmash-gather

how do we showcase the accuracy of gather results, without using taxonomy? #3