Suggested edits to the gather paper, and a few inline comments

I'm including suggested edits in this PR. Most are typos, some add a little clarification, and I added one or two comments inline. I'm also going to provide some comments here in this PR.

Overall, I really loved the title and the discussion of this paper. I think the set of results provided are the right set, and I like the organization of the results section. I think the paper is fairly lean though. In particular, I think the introduction could use more contextualizing information, that the results section should have a summary first paragraph (many of the details of which are already covered in the intro, and could be moved here), and the discussion should have a summary first paragraph. Lastly, I think that while there are enough references to sourmash and its capabilities, we don't reference specific commands enough (e.g., gather), so that obfuscates how to actually use the thing from a user.

I feel pretty strongly that the discussion needs a baby summary intro paragraph, but could be convinced that the intro and results don't need to change, depending on the intended audience of the paper. But without these things, the audience of the paper does not overlap with the intended users of the tools very much, which is a bit of a missed opportunity to put this tool into the hands of people who could use it most.

Other specific comments:

Does Figure 1 need to be smaller? I think a coord flip would help it take up about ~half the space it does. The axis fonts are also very smol
For the sentence, "Note that in cases where equivalent matches are available at a particular rank, a match is chosen at random.": Should we mention that it's easy to return all matches (e.g. with search)? Or that the software is not algorithmically limited to do this, and that it was just a design decision?
Is the database consistent through the results? I guess the date of the GenBank database will probably be added in the methods?
For the section, "Minimum metagenome covers provide representative genomes for mapping", have we/should we show that gather picks the genomes that maps the most reads over all other closely related genomes in GenBank? Maybe not using e.g. salmonella, but maybe with a genome that has a few hundred other genomes of the same species?
The italics on Scaled MinHash and MinHash are inconsistent. I think I fixed all of the Scaled MinHash ones, but I wasn't sure if MinHash was supposed to be italicized or not so I just left it.
Figure 5 is titled, "Hash-based decomposition of a metagenome...". Could we change it to `Hash-based k-mer decomposition of a metagenome..."?

Thanks, @taylorreiter!

When I walked through the paper in lab, some of the same questions about target audience came up there as well (with more aggressive confusion - "who the bleep do you think can read this paper!?" 😆

In response, I went and showed everyone the reviews from our F1000 Research paper.

f1000 research reviews

Brad Solomon said -

Excluding the description of the modulo approach of sketch construction, the manuscript itself is technically sound. ...

The ‘modulo approach’ for sketch construction, despite being one of the main innovations of the method, is particularly unclear in the manuscript. The cited literature (Broder 1997) describes an approach that sub-samples hash values based on a modulo factor to address the inherent weakness of a Minhash in a mixture of several distinct components. However the description of the sourmash implementation instead describes splitting the hash space into ‘equal bands’ and selecting only the minimum band. As the existing modulo approach has no guarantees on equal-sized (or even equal-fraction as the manuscript claims elsewhere) sub-sampling, this appears to be a novel and significant contribution to the field. However there are no details that explain (1) how the hash space is divided, (2) how the minimum band is selected, and (3) how downsampling is performed.

Sourmash 2.0 is motivated by “a particular focus towards enabling efficient containment queries using large databases”. However the manuscript does not include any true comparisons about sourmash’s performance against existing tools, alternative approaches, or benchmarking information for even conventionally sized datasets. This greatly limits the potential impact of sourmash given there are many competing sketch strategies and an even larger range of available implementations.

While it is unreasonable to expect a full review of the available methods, the inclusion of even a single ‘large-scale’ dataset in the test set or use cases would go a long way towards demonstrating the scalability of sourmash. Selecting a biologically relevant subset from a public genomic repository such as the NIH SRA, TCGA, or GTEx (to name just a few) would alleviate the need to host such a dataset while allowing large-scale reproducibility and benchmarking.

Rayan Chikhi said:

The authors present sourmash 2, a tool that implements a novel combination of SBTs and MinHashes, which are both fascinating computational concepts; thus, their mix is quite an interesting one. Sourmash 2 enables to perform large-scale sequences-vs-database similarity searches. The article offers a comprehensive guide for many of the software features, with biologically relevant scenarios. This is a useful contribution that is highly relevant to current needs in biology. There are a few technical issues with the current manuscript version that I list below. But otherwise, most of my remarks are for adding some extra perspective. I believe the manuscript can be approved after the technical fixes.

Major remarks:

A quick recap of the state of the art in containment search would be helpful. Here you claim to use ‘a modulo approach’. Mash screen and containment minhash use different approaches (see e.g. the blog post of ‘Mash screen’). It would be nice if, in this paper, the usage of the modulo approach was put into perspective compared to those two aforementioned methods. ... In fact, in the blog post cited as reference 8, Ondov writes that “the modulo approach is problematic for metagenomic applications (e.g. finding a virus in a metagenome).” The problem is indirectly mentioned in the manuscript (“can sacrifice some of the memory and storage benefits of standard MinHash techniques, as the signature size scales with the number of unique k-mers”). It would be neat to get the authors’ comparative perspective here as to why using modulo is the better approach. ... What are roughly the limits of similarity queries? E.g. sequences shorter than X or having identity below Y% have no chances to be reported.

...

The description of the modulo approach used is imprecise. How is the hash space divided into s equal ‘bands’ (undefined term), precisely? Also, I suppose this somewhat different from the modulo approach proposed by Broder, and clarified in Mash screen’s blog post, but how so?

...

The concept of ‘hash subset retention’ is not well defined. I suppose it is the set of hashes that result from a MinHash computation.

back to this paper --

I think I would target this paper for people like Brad and Rayan - researchers who want to understand the underpinnings of what we have implemented with sourmash, why it works / with what bounds, and what the actual computer science points are.

happy to adjust, but that's the perspective I was coming from!

specific responses to taylor's comments above -

Lastly, I think that while there are enough references to sourmash and its capabilities, we don't reference specific commands enough (e.g., gather), so that obfuscates how to actually use the thing from a user.

I think this will be fixed by providing the commands in the Methods, although we can and should add notes like "implemented as sourmash prefetch" or whatnot in the results section.

But without these things, the audience of the paper does not overlap with the intended users of the tools very much, which is a bit of a missed opportunity to put this tool into the hands of people who could use it most.

I think that's an ok tradeoff, given we have the F1000 research paper (which we can update, or publish something new), the tutorials, many blog posts, and a large-ish user base already.

dib-lab / 2020-paper-sourmash-gather

Suggested edits to the gather paper, and a few inline comments #35

f1000 research reviews

back to this paper --