Closed ctb closed 2 years ago
Merging for now to pull in these changes, but I'll keep working through the rest of your comments!
RELATED QUESTION: are you running this on the proteome or the genome? (If the proteome, then codon degeneracy is definitely not an issue :)
In the first section of the paper, I'm using the proteomes. In the second two sections, I use orpheum to get the "proteomes", and then run it from there. Not sure how/if to make this more clear?
Pangenomes comprise all genes found within a group of organisms
Is this really the definition of pangenomes? (I'm not sure. It struck me by surprise, is all.)
Tis. I have a good reference for it that I'll be sure to add.
Second, you could try to drive the point home by doing a nucleotide comparison. The simplest would be to show that these correlations do not hold, I guess? Or are worse? But ... dunno. Might be good to think about and/or explore, and see how clear you can make the results.
kk. I'll see what I can come up with, and how clear the results are. Tessa has some really beautiful figures that I think conceptually address this in a general way that encompasses all of GTDB, so I'll definitely be sighting those, but I think you're right, if the results are clear, it may be nice to present them here.
More thoughts later, thank you again!
Merging for now to pull in these changes, but I'll keep working through the rest of your comments!
Yeah, I was thinking I should have opened an issue with the checklist items, and then referenced the PR. Next time!
RELATED QUESTION: are you running this on the proteome or the genome? (If the proteome, then codon degeneracy is definitely not an issue :)
In the first section of the paper, I'm using the proteomes. In the second two sections, I use orpheum to get the "proteomes", and then run it from there. Not sure how/if to make this more clear?
I can't tell if it's my own brain misleading me or if it's more general, but it took me a while to figure it all out, even though you would think I would get it immediately. I wonder if a process diagram like Figure 1 or 2 in olga's paper might be good? I think you have something similar in the IBD paper, too.
Pangenomes comprise all genes found within a group of organisms
Is this really the definition of pangenomes? (I'm not sure. It struck me by surprise, is all.)
Tis. I have a good reference for it that I'll be sure to add.
Interesting. Just chewing that over, I'm frustrated at "pangenome" taken to mean genes. But OK :).
Second, you could try to drive the point home by doing a nucleotide comparison. The simplest would be to show that these correlations do not hold, I guess? Or are worse? But ... dunno. Might be good to think about and/or explore, and see how clear you can make the results.
kk. I'll see what I can come up with, and how clear the results are. Tessa has some really beautiful figures that I think conceptually address this in a general way that encompasses all of GTDB, so I'll definitely be sighting those, but I think you're right, if the results are clear, it may be nice to present them here.
👍
comments on your first draft!
I really like this first draft! The results are compelling, and (I am probably biased here) the argument is pretty clear!
(The check boxes are for when the item is considered by you, and need not be when it is resolved to my suggestion :)
big stuff
[x] I think you should evaluate containment/similarity of gene-based content within k-mer neighborhoods. It is not immediately obvious to me whether you can do this with the pangenomes (I think you can - core / shell / cloud?). You might need to do this with the metagenomes, tho, which would necessitate calculating the gene-based numbers on those, too.
[x] is there a way to link content to function with the k-mer based approaches? (I know, may be out of scope for this paper :)
[x] suggest/request that the papers and pipelines be moved into dib-lab/ org at some point :). Also, zenodo enable.
[x] it actually took me a little while to understand the full flow - that orpheum is needed so you don't need to do six-frame translation. Maybe explicitly mention this somewhere - that without orpheum, you'd need to do six-frame translation and this would inflate the number of k-mers. Maybe it belongs in intro? Definitely belongs in the relevant results section.
[x] it might be good (somewhere) to talk about how working with reads is better than working with cDBGs, because in regions of high error / high variation, the cDBG nodes or often shorter than reads.
small stuff
I updated the third-base pair wobble sentence to read:
but I am actually thinking that we should remove codon degeneracy because all three of these encodings explicitly ignore codon degeneracy!
First, you could simply address it with citations of some kind. Tessa's paper might be enough! And/or the kraken paper, or other papers that Tessa can suggest.
Second, you could try to drive the point home by doing a nucleotide comparison. The simplest would be to show that these correlations do not hold, I guess? Or are worse? But ... dunno. Might be good to think about and/or explore, and see how clear you can make the results.
[x] Figure 2: suggest changing 'alphabet' to 'encoding' on the legend, and maybe adding 'aa' to 'k-mer size' - e.g. 'k-mer size (aa)' or 'aa k-mer size'
[x] Figure 3: 'exlcuded' in panel C -> excluded
[x] Figure 3: seems like panel A and B could be one plot?
[x] Heaps law: citation?
Title of results section:
[x] Figure 5 - had a really hard time seeing the "Coding" and "Non-coding" headers. Bold? Make bigger? Something?
[x] Figure 6 - I love the content, but the whitespace on the top graph is ...annoying. I wonder if there is a way to combine the top and bottom graphs; maybe large colored marks on the x axis of the bottom graph for when abx were administered?
[x] Figure 7 - can you add guide markers or guide arrows for the points you are making in the text about P. vulgatus, etc?
Discussion: