dib-lab / 2020-paper-sourmash-gather

Here we describe an extension of MinHash that permits accurate compositional analysis of metagenomes with low memory and disk requirements.
https://dib-lab.github.io/2020-paper-sourmash-gather
Other
8 stars 1 forks source link

how do we show "minimal" or close-to-minimal? how do we measure approximation? #15

Open ctb opened 3 years ago

ctb commented 3 years ago

opening a new issue to focus on minimal wording...

related to #3, and this comment specifically. quotes from that issue:

algorithmic question: what am I missing (if anything) about the greedy solution to the min-set-cov problem? under what conditions would we not be looking at an optimal solution here?

one argument is that we are showing that all of our reads map to these genomes, successively and in decreasing numbers - that is, the unique k-mer containment for gather is monotonically decreasing, and the read mapping matches that pretty well. SO, at each stage, there is no genome that has more hash matches in it than this genome, and hence probably no genome to which more reads can be mapped.

there are two approximations involved in our implementation -

  • scaled minhash means that there may be genomes with more actual k-mers that match, but that is bounded, so it can't be too bad
  • k-mers do not equal mapping, although the observed correlation is pretty strong, and in two real metagenomes, it looks like mapping is more sensitive than hashes.

a related question/point to consider is whether in this case the greedy approximation is actually optimal, under the conditions that hold with genome databases. No idea! however, in our implementation, we focus on best containment and do not resolve between two matches with equal containment but different genome sizes, so we may not be optimal in that sense of the word.

taylor suggests mapping to all r. gnavus isolates.

ctb commented 3 years ago

a related way to frame the question: are there conditions specific to k-mers/genomes that would make the greedy approximation approach actually be optimal?