dib-lab / 2021-paper-metapangenomes

Other
1 stars 2 forks source link

[MRG] minor edits for first draft #2

Closed ctb closed 2 years ago

ctb commented 2 years ago

comments on your first draft!

I really like this first draft! The results are compelling, and (I am probably biased here) the argument is pretty clear!

(The check boxes are for when the item is considered by you, and need not be when it is resolved to my suggestion :)


big stuff


small stuff

Pangenomes comprise all genes found within a group of organisms

Reduced alphabet k-mers accurately estimate microbial pangenomes

pangenome size


I updated the third-base pair wobble sentence to read:

reduced alphabet k-mer is sufficient to overcome minor variations such as those introduced by codon degeneracy or evolutionary drift

but I am actually thinking that we should remove codon degeneracy because all three of these encodings explicitly ignore codon degeneracy!



Do I need to discuss scaled at all

Should I compare this against nucleotide k-mers at all?

First, you could simply address it with citations of some kind. Tessa's paper might be enough! And/or the kraken paper, or other papers that Tessa can suggest.

Second, you could try to drive the point home by doing a nucleotide comparison. The simplest would be to show that these correlations do not hold, I guess? Or are worse? But ... dunno. Might be good to think about and/or explore, and see how clear you can make the results.



We next investigated whether other pangenome metrics were well correlated between k-mer-based and gene-based methods


Title of results section:

Jaccard containment between reduced alphabet k-mers and k-mers in databases accurately predicts open reading frames in short sequencing reads


Using default parameters, orpheum accurately separated coding from non-coding reads when reads were simulated from genomes in GTDB



Discussion:

We show that pangenome metrics like core, cloud, and shell pangenome fractions can be accurately estimated ... with k-mers from other reduced alphabets.

taylorreiter commented 2 years ago

Merging for now to pull in these changes, but I'll keep working through the rest of your comments!

taylorreiter commented 2 years ago

RELATED QUESTION: are you running this on the proteome or the genome? (If the proteome, then codon degeneracy is definitely not an issue :)

In the first section of the paper, I'm using the proteomes. In the second two sections, I use orpheum to get the "proteomes", and then run it from there. Not sure how/if to make this more clear?

Pangenomes comprise all genes found within a group of organisms

Is this really the definition of pangenomes? (I'm not sure. It struck me by surprise, is all.)

Tis. I have a good reference for it that I'll be sure to add.

Second, you could try to drive the point home by doing a nucleotide comparison. The simplest would be to show that these correlations do not hold, I guess? Or are worse? But ... dunno. Might be good to think about and/or explore, and see how clear you can make the results.

kk. I'll see what I can come up with, and how clear the results are. Tessa has some really beautiful figures that I think conceptually address this in a general way that encompasses all of GTDB, so I'll definitely be sighting those, but I think you're right, if the results are clear, it may be nice to present them here.

More thoughts later, thank you again!

ctb commented 2 years ago

Merging for now to pull in these changes, but I'll keep working through the rest of your comments!

Yeah, I was thinking I should have opened an issue with the checklist items, and then referenced the PR. Next time!

ctb commented 2 years ago

RELATED QUESTION: are you running this on the proteome or the genome? (If the proteome, then codon degeneracy is definitely not an issue :)

In the first section of the paper, I'm using the proteomes. In the second two sections, I use orpheum to get the "proteomes", and then run it from there. Not sure how/if to make this more clear?

I can't tell if it's my own brain misleading me or if it's more general, but it took me a while to figure it all out, even though you would think I would get it immediately. I wonder if a process diagram like Figure 1 or 2 in olga's paper might be good? I think you have something similar in the IBD paper, too.

Pangenomes comprise all genes found within a group of organisms

Is this really the definition of pangenomes? (I'm not sure. It struck me by surprise, is all.)

Tis. I have a good reference for it that I'll be sure to add.

Interesting. Just chewing that over, I'm frustrated at "pangenome" taken to mean genes. But OK :).

Second, you could try to drive the point home by doing a nucleotide comparison. The simplest would be to show that these correlations do not hold, I guess? Or are worse? But ... dunno. Might be good to think about and/or explore, and see how clear you can make the results.

kk. I'll see what I can come up with, and how clear the results are. Tessa has some really beautiful figures that I think conceptually address this in a general way that encompasses all of GTDB, so I'll definitely be sighting those, but I think you're right, if the results are clear, it may be nice to present them here.

👍