ctb commented 2 years ago

comments on your first draft!

I really like this first draft! The results are compelling, and (I am probably biased here) the argument is pretty clear!

(The check boxes are for when the item is considered by you, and need not be when it is resolved to my suggestion :)

big stuff

[x] I think you should evaluate containment/similarity of gene-based content within k-mer neighborhoods. It is not immediately obvious to me whether you can do this with the pangenomes (I think you can - core / shell / cloud?). You might need to do this with the metagenomes, tho, which would necessitate calculating the gene-based numbers on those, too.
[x] is there a way to link content to function with the k-mer based approaches? (I know, may be out of scope for this paper :)
[x] suggest/request that the papers and pipelines be moved into dib-lab/ org at some point :). Also, zenodo enable.
[x] it actually took me a little while to understand the full flow - that orpheum is needed so you don't need to do six-frame translation. Maybe explicitly mention this somewhere - that without orpheum, you'd need to do six-frame translation and this would inflate the number of k-mers. Maybe it belongs in intro? Definitely belongs in the relevant results section.
[x] it might be good (somewhere) to talk about how working with reads is better than working with cDBGs, because in regions of high error / high variation, the cDBG nodes or often shorter than reads.

small stuff

[x] I like the title "Protein k-mers enable assembly-free microbial metapangenomics"!

Pangenomes comprise all genes found within a group of organisms

[x] Is this really the definition of pangenomes? (I'm not sure. It struck me by surprise, is all.)

Reduced alphabet k-mers accurately estimate microbial pangenomes

[x] Maybe: "accurately estimate the size of", or "characteristics of"? or "size and content of", if you add that?

pangenome size

[x] suggest adding median, in addition to mean

I updated the third-base pair wobble sentence to read:

reduced alphabet k-mer is sufficient to overcome minor variations such as those introduced by codon degeneracy or evolutionary drift

but I am actually thinking that we should remove codon degeneracy because all three of these encodings explicitly ignore codon degeneracy!

[x] think about removing codon degeneracy here

[x] RELATED QUESTION: are you running this on the proteome or the genome? (If the proteome, then codon degeneracy is definitely not an issue :)

Do I need to discuss scaled at all

[x] No, the gather paper should be enough.

Should I compare this against nucleotide k-mers at all?

[x] Two thoughts (that conflict :).

First, you could simply address it with citations of some kind. Tessa's paper might be enough! And/or the kraken paper, or other papers that Tessa can suggest.

Second, you could try to drive the point home by doing a nucleotide comparison. The simplest would be to show that these correlations do not hold, I guess? Or are worse? But ... dunno. Might be good to think about and/or explore, and see how clear you can make the results.

[x] Figure 2: suggest changing 'alphabet' to 'encoding' on the legend, and maybe adding 'aa' to 'k-mer size' - e.g. 'k-mer size (aa)' or 'aa k-mer size'
[x] Figure 3: 'exlcuded' in panel C -> excluded
[x] Figure 3: seems like panel A and B could be one plot?
[x] Heaps law: citation?

We next investigated whether other pangenome metrics were well correlated between k-mer-based and gene-based methods

[x] here I would at least mention prokka and roary with a "See Methods for details". Jumping to the methods to make sure people used vaguely correct software is always annoying as a reader :)

Title of results section:

Jaccard containment between reduced alphabet k-mers and k-mers in databases accurately predicts open reading frames in short sequencing reads

[x] Maybe: "k-mer methods accurately predict open reading frames in short sequencing reads"

Using default parameters, orpheum accurately separated coding from non-coding reads when reads were simulated from genomes in GTDB

[x] For this paragraph, I feel like some minimal discussion of the numbers in the text itself is needed / would be useful.

[x] Figure 5 - had a really hard time seeing the "Coding" and "Non-coding" headers. Bold? Make bigger? Something?
[x] Figure 6 - I love the content, but the whitespace on the top graph is ...annoying. I wonder if there is a way to combine the top and bottom graphs; maybe large colored marks on the x axis of the bottom graph for when abx were administered?
[x] Figure 7 - can you add guide markers or guide arrows for the points you are making in the text about P. vulgatus, etc?

Discussion:

We show that pangenome metrics like core, cloud, and shell pangenome fractions can be accurately estimated ... with k-mers from other reduced alphabets.

[x] This is an inference, right? Might want to make that clear :)

taylorreiter commented 2 years ago

Merging for now to pull in these changes, but I'll keep working through the rest of your comments!

taylorreiter commented 2 years ago

RELATED QUESTION: are you running this on the proteome or the genome? (If the proteome, then codon degeneracy is definitely not an issue :)

In the first section of the paper, I'm using the proteomes. In the second two sections, I use orpheum to get the "proteomes", and then run it from there. Not sure how/if to make this more clear?

Pangenomes comprise all genes found within a group of organisms

Is this really the definition of pangenomes? (I'm not sure. It struck me by surprise, is all.)

Tis. I have a good reference for it that I'll be sure to add.

Second, you could try to drive the point home by doing a nucleotide comparison. The simplest would be to show that these correlations do not hold, I guess? Or are worse? But ... dunno. Might be good to think about and/or explore, and see how clear you can make the results.

kk. I'll see what I can come up with, and how clear the results are. Tessa has some really beautiful figures that I think conceptually address this in a general way that encompasses all of GTDB, so I'll definitely be sighting those, but I think you're right, if the results are clear, it may be nice to present them here.

More thoughts later, thank you again!

ctb commented 2 years ago

Merging for now to pull in these changes, but I'll keep working through the rest of your comments!

Yeah, I was thinking I should have opened an issue with the checklist items, and then referenced the PR. Next time!

ctb commented 2 years ago

RELATED QUESTION: are you running this on the proteome or the genome? (If the proteome, then codon degeneracy is definitely not an issue :)

In the first section of the paper, I'm using the proteomes. In the second two sections, I use orpheum to get the "proteomes", and then run it from there. Not sure how/if to make this more clear?

I can't tell if it's my own brain misleading me or if it's more general, but it took me a while to figure it all out, even though you would think I would get it immediately. I wonder if a process diagram like Figure 1 or 2 in olga's paper might be good? I think you have something similar in the IBD paper, too.

Pangenomes comprise all genes found within a group of organisms

Is this really the definition of pangenomes? (I'm not sure. It struck me by surprise, is all.)

Tis. I have a good reference for it that I'll be sure to add.

Interesting. Just chewing that over, I'm frustrated at "pangenome" taken to mean genes. But OK :).

Second, you could try to drive the point home by doing a nucleotide comparison. The simplest would be to show that these correlations do not hold, I guess? Or are worse? But ... dunno. Might be good to think about and/or explore, and see how clear you can make the results.

kk. I'll see what I can come up with, and how clear the results are. Tessa has some really beautiful figures that I think conceptually address this in a general way that encompasses all of GTDB, so I'll definitely be sighting those, but I think you're right, if the results are clear, it may be nice to present them here.

👍

dib-lab / 2021-paper-metapangenomes

[MRG] minor edits for first draft #2

comments on your first draft!

big stuff

small stuff

Discussion: