dib-lab / 2020-paper-sourmash-gather

Here we describe an extension of MinHash that permits accurate compositional analysis of metagenomes with low memory and disk requirements.
https://dib-lab.github.io/2020-paper-sourmash-gather
Other
8 stars 1 forks source link

explore david's concerns #18

Open ctb opened 3 years ago

ctb commented 3 years ago

@dkoslicki comment on gather from luiz thesis:

"I'm surprised this works, since back in 2015
(Metapalette days) I found removing elements like this caused the
approach to fall apart when closely-related organisms are in the
metagenome.)

dkoslicki commented 3 years ago

Contrived example where this would be the case: a "metagenome" with two genomes that have high ANI. The hashing gets "unlucky" and the sketches for the two genomes are identical (or near to it). Min-set-cov predicts only a single genome as a result.

Eg. Two genomes with 99% ANI and of length 4.5Mbp are expected (95% confidence interval) to share between 80.8% and 81.1% of 21-mers in common.