dib-lab / 2020-paper-sourmash-gather

Here we describe an extension of MinHash that permits accurate compositional analysis of metagenomes with low memory and disk requirements.
https://dib-lab.github.io/2020-paper-sourmash-gather
Other
8 stars 1 forks source link

discussion of taylor's suggestions from #35 #36

Open ctb opened 2 years ago

ctb commented 2 years ago

All of the specific changes in #35 looked great, so I merged it! This issue tackles the larger content and style suggestions made in the PR description:

@taylorreiter suggests

I'm including suggested edits in this PR. Most are typos, some add a little clarification, and I added one or two comments inline. I'm also going to provide some comments here in this PR.

Overall, I really loved the title and the discussion of this paper. I think the set of results provided are the right set, and I like the organization of the results section. I think the paper is fairly lean though. In particular, I think the introduction could use more contextualizing information, that the results section should have a summary first paragraph (many of the details of which are already covered in the intro, and could be moved here), and the discussion should have a summary first paragraph. Lastly, I think that while there are enough references to sourmash and its capabilities, we don't reference specific commands enough (e.g., gather), so that obfuscates how to actually use the thing from a user.

I feel pretty strongly that the discussion needs a baby summary intro paragraph, but could be convinced that the intro and results don't need to change, depending on the intended audience of the paper. But without these things, the audience of the paper does not overlap with the intended users of the tools very much, which is a bit of a missed opportunity to put this tool into the hands of people who could use it most.

Other specific comments:

  • Does Figure 1 need to be smaller? I think a coord flip would help it take up about ~half the space it does. The axis fonts are also very smol

Note issue https://github.com/dib-lab/2020-paper-sourmash-gather/issues/39

  • For the sentence, "Note that in cases where equivalent matches are available at a particular rank, a match is chosen at >random.": Should we mention that it's easy to return all matches (e.g. with search)? Or that the software is not >algorithmically limited to do this, and that it was just a design decision?
  • Is the database consistent through the results? I guess the date of the GenBank database will probably be added in the methods?
  • For the section, "Minimum metagenome covers provide representative genomes for mapping", have we/should we show that gather picks the genomes that maps the most reads over all other closely related genomes in GenBank? Maybe not using e.g. salmonella, but maybe with a genome that has a few hundred other genomes of the same species?
  • The italics on Scaled MinHash and MinHash are inconsistent. I think I fixed all of the Scaled MinHash ones, but I wasn't sure if MinHash was supposed to be italicized or not so I just left it.
  • Figure 5 is titled, "Hash-based decomposition of a metagenome...". Could we change it to `Hash-based k-mer decomposition of a metagenome..."?