Manuscript comments - Githubissues

kyleabeauchamp commented 9 years ago

First from Sonya, then from me.

General comments:

1a. I like Kyle's comment that for a method's paper there is maybe too much analysis of the kinase data, especially considering that the analysis is not automatically performed by the method… Given this, I feel like it might make sense to expand figure 1 into several figures. Figure 1 is pretty complicated in its current state... The one big panel of template structures and target structures, could be a figure on it's own. Maybe the other figure can visualize the amount of time/space each step takes?

1b. Relatedly, figures 3 and 4 can probably be combined into one figure with several panels or something, as their contents are pretty similar.

1c. Also, it would be very helpful if the steps illustrated in figure 1, were labeled individually (a,b,c or i,ii,iii or whatever) so that the individual steps in the figure can be referred to in the text as they are being explained.

I feel like the relationship of this work to building MSM's should be emphasized more clearly in the Introduction (lines 615-619 might be relevant to mention earlier, for example). Specifically the paragraph that seems to address this most head on (lines 66-85) is extremely weak, though it is possible just taking out words like 'could' and 'may' will help this.
It seems like the Rosetta loop modeling is not explicitly mentioned in the methods? There is a mention of 'database-derived statistics' that seems to refer to Modeller, I'm wondering if this was instead supposed to be a reference to Rosetta?
There is a big section about the decision to cluster separate chains of the same PDB - what was the result of this: how many PDB's had significantly different conformations in different chains?
I would avoid calling these 'low resolution structures' rather than models (lines 326-341), especially as the sequence identity for some of these is still very low.
Figure 8 is the money shot. BOOYAH!
I think it is worth mentioning in the Discussion that often individual members of superfamilies have very different functionalities despite having similar structure, such that not all conformations seen in a superfamily will necessarily be relevant to every individual member. For us this is fine, as we will be using MSM's, and if the structure doesn't interconvert with the others that we don't include it in the analysis, etc., etc. Seems worth mentioning...
The paragraphs about the importance of kinases (line 576-587, lines 456-474) are a bit weak. I wouldn't start with Src as an important kinase to target, as we currently have no idea, really, what it's significance to human cancer is. Abl is probably a better place to start. Happy to help with this. Maybe good to reference some sort of review...
Would be cool to see a sequence alignment of kinases representative of the three sequence identity groups in Figure 4B. Happy to help with this, too.

Specific comments:

Line 128: the sequences the user is interested in generating simulation-ready structural models for => the sequences for which the user is interested in generating simulation-ready structural models.
Line 139: us => is
Line 181 and Line 187 use 'ensembler gather_templates', however for other subcommand descriptions 'ensembler' is left out. This is maybe fine, but just thought I'd mention it for the sake of consistency.
Line 263: 'residues spans'
Line 284: 'the the BioPython'
Line 344: As well improving => As well as improving
Fig. 2 would be good to indicate numbers mentioned in the text (mean/median/max) on the figure itself.
All four instances of the word 'various' I think can be taken out. (Personal preference)
Fig 6. What is with this extra alpha helix that seems to have unraveled in some cases on the left of Abl. It doesn't seem to be in Src...

Overall looks good. Below are my initial comments.

General comments:

A lot of command line code is included in the main text. I worry somewhat that some journals would prefer to keep the main text about the principles, saving the actual command lines for the SI or even the instruction manual. In a related note, IMHO having command lines as part of the main text implies a promise of strong API stability---whereas "hiding" some of the command line instructions in the SI might actually make a less strong statement about API stability. Obviously this is mostly my gut intuition here and can be ignored.
Regarding API stability: we should definitely include a specific version / release number in the "Availability" section, so that people are using the correct version for the included commands.
There are actually some journals that love having lots of explicit command line instructions in the paper (e.g. "Methods"), so we can keep these in mind if we have difficulty (e.g. length) with our currently targeted journals.

Specific Comments:

"run on all major operating systems." Maybe we should just claim compatibility with OSX and linux, as windows is a bear?
p5 line 331 Need space after angstrom, I can dig this out of my latex if you have issues fixing it.
"Ensembler Modeling Statistics". Is it standard to report (median, standard deviation, and max) without reporting the mean? To me, any time I see the standard deviation I also expect to see a mean.
Figure 5. If we use units of kT for energies, I think we need to have the temperature included in the xlabel or the caption. I also wonder if comp. bio. people even like to see kT units, not sure whether they appreciate this unit system. I imagine John might prefer kT, however, so I'm flexible here
p9 line 657. "conda ensembler" should be "conda install ensembler"
Acknowledgements: Delete KAB and SMH from acknowledgements.
Acknowledgements: AFAIK, I don't think I need to add any funding acknowledgements.
Figures 7 and 8. To me, the "src / abl" examples came up somewhat suddenly and took up a lot of page space, given that the focus of the paper is on the method. I wonder if merging figures 7 and 8 into one figure might be worth pursuing. Obviously, this one is optional as it would require a lot of unnecessary work.

jchodera commented 9 years ago

Thanks so much for the comments! These are all great, and I agree with essentially everything with the exception of:

Too much focus on the analysis of kinase data. We don't have a lot of evidence of utility of this approach (other than the utility of being able to easily generate models) besides our ability to show it appears to yield good coverage of functionally relevant space for Abl/Src and its ability to model the entire TK, so we can't really take this out or deemphasize it unless we can figure out what to replace it with.
Flowchart diagram. A good overview flowchart that shows the whole concept is essential, I think. I wouldn't support breaking it into separate pieces, unless those were in addition to this master diagram (to "zoom in" on parts of interest). The other suggestions, like adding labels to match text descriptions to flowchart, are all great though!

The idea that most of the analysis not being automated does make me wonder what analysis would be general enough to include in Ensembler proper as automated analyses the user could ask to perform. Perhaps we could add a stage to do some basic generation of distributions of sequence identity, etc.?

Also, we need to think more about what kinds of analyses we could apply to the TK data. One thought that just occurred to me: For the TK sequences with some structures with nearly 100% sequence, does our modeling procedure screw these up? We could compute the RMSD between template and model just to make sure things aren't screwed up too much by automated modeling.

jchodera commented 9 years ago

I think the next step is for @danielparton to integrate these comments.

I believe @pgrinaway is also checking in comments...

danielparton commented 9 years ago

Kyle: 3. "Ensembler Modeling Statistics". Is it standard to report (median, standard deviation, and max) without reporting the mean? To me, any time I see the standard deviation I also expect to see a mean.

I can report the mean. In that case, should I just skip the median? Or include both mean and median?

danielparton commented 9 years ago

Responses to various comments here.

Kyle 4: Figure 5. If we use units of kT for energies, I think we need to have the temperature included in the xlabel or the caption. I also wonder if comp. bio. people even like to see kT units, not sure whether they appreciate this unit system. I imagine John might prefer kT, however, so I'm flexible here

@jchodera do you want to stick with kT? If kT, I'll add the temperature in the caption.

Sonya 1: I feel like the relationship of this work to building MSM's should be emphasized more clearly in the Introduction (lines 615-619 might be relevant to mention earlier, for example). Specifically the paragraph that seems to address this most head on (lines 66-85) is extremely weak, though it is possible just taking out words like 'could' and 'may' will help this.

So do you think I'm overusing conditional grammar here? Maybe this is a British vs American English thing, since I think British people tend to use conditional grammar more often - even when we think the contained statement is definitely true... So to me, the use of conditional syntax does not make this paragraph sound weak. How do the other Americans here feel about this? The paragraph in question (lines 66-85) is copied below.

In terms of evaluating the work in relation to construction of MSMs, I think the ultimate test is whether it can aid efficiency of sampling, which we won't really know until we have completed a large-scale MSM project (or projects). So I'm not sure what else I could add to make this paragraph sound much stronger. However, I could at least add a sentence somewhere in the Introduction saying that we expect high seqid structures to represent a subset of native-like states, and low seqid structures to aid in sampling more distant regions of phase space.

The ability to fully exploit the large quantity of available protein sequence and structural data in biomolecular simulation studies could open up many interesting avenues for research, enabling the study of entire protein families or superfamilies within a single organism or across multiple organisms. The similarity between members of a given protein family could be exploited to generate arrays of conformational models, which could be used as starting configurations to aid sampling in MD simulations. This approach would be highly beneficial for many MD methods, such as MSM construction, which require global coverage of the conformational landscape to realize their full potential, and would also be particularly useful in cases where structural data is present for only a subset of the members of a protein family. It would also aid in studying protein families known to have multiple metastable conformations---such as kinases---for which the combined body of structural data for the family may cover a large range of these conformations, while the available structures for any individual member might encompass only one or two distinct conformations.

Sonya 2: It seems like the Rosetta loop modeling is not explicitly mentioned in the methods? There is a mention of 'database-derived statistics' that seems to refer to Modeller, I'm wondering if this was instead supposed to be a reference to Rosetta?

See the "Template refinement" subsection - line 235 onwards.

Sonya 3: There is a big section about the decision to cluster separate chains of the same PDB - what was the result of this: how many PDB's had significantly different conformations in different chains?

The clustering is conducted on the set of all models for a given target, which are derived from the set of all template PDB chains. It is not done on a per-PDB basis. The overall statistics can be seen in the pipeline figure, e.g. indicating that of 4248 Src models, 4093 were found to be unique. I could go back and analyze the statistics on a per-PDB basis - average ratio of unique/non-unique chains, or something - if that would be useful?

Sonya 4: I would avoid calling these 'low resolution structures' rather than models (lines 326-341), especially as the sequence identity for some of these is still very low.

Sentence in question: While the utility of comparative modeling methods has been greatly enhanced by the recent explosion in the availability of protein structural data, the structures generated are generally considered "low-resolution" in comparison to those derived using experimental techniques such as X-ray crystallography.

I'm talking about homology modeling in general, not the specific TK models generated in this work - just wanted to make sure that is clear. I think it is reasonable to say that homology models are generally considered "low-resolution" (even those with high sequence identity). Suggestions for an alternative phrase are welcome though.

Sonya 6: I think it is worth mentioning in the Discussion that often individual members of superfamilies have very different functionalities despite having similar structure, such that not all conformations seen in a superfamily will necessarily be relevant to every individual member. For us this is fine, as we will be using MSM's, and if the structure doesn't interconvert with the others that we don't include it in the analysis, etc., etc. Seems worth mentioning...

Are you talking about the TKs specifically, or just superfamilies in general? TK catalytic domains obviously all have the same function. More generally speaking, I would say that proteins with different function (but which do have sequence homology) may still have structures available which are accessible by both proteins. However, I think it would be good to mention that we do not expect all generated models to be true "native" conformations. And also that MSM methods are to some extent resilient to poor starting structures - in the sense that disconnected trajectories can be ignored.

Sonya 7: The paragraphs about the importance of kinases (line 576-587, lines 456-474) are a bit weak. I wouldn't start with Src as an important kinase to target, as we currently have no idea, really, what it's significance to human cancer is. Abl is probably a better place to start. Happy to help with this. Maybe good to reference some sort of review...

Ok, I can switch these round an start with Abl. Lines 456-474 contain citations to reviews on the cancer involvement of both Abl and Src. Suggestions for additional or alternative citations are welcome.

Sonya 8: Would be cool to see a sequence alignment of kinases representative of the three sequence identity groups in Figure 4B. Happy to help with this, too.

This sounds useful, but it's not clear to me how exactly one would approach this. Do you have specific suggestions? I guess the biggest question would be how to choose representative sequences from each seqid category. Otherwise I could do an MSA of all 90 TK target sequences, ranked according to seqid, and colored according to seqid category. This could go in the SI.

jchodera commented 9 years ago

@jchodera do you want to stick with kT? If kT, I'll add the temperature in the caption.

No strong feelings about kT or kcal/mol, but either way, we should list the temperature in the caption.

jchodera commented 9 years ago

So do you think I'm overusing conditional grammar here? Maybe this is a British vs American English thing, since I think British people tend to use conditional grammar more often - even when we think the contained statement is definitely true... So to me, the use of conditional syntax does not make this paragraph sound weak. How do the other Americans here feel about this? The paragraph in question (lines 66-85) is copied below.

In terms of evaluating the work in relation to construction of MSMs, I think the ultimate test is whether it can aid efficiency of sampling, which we won't really know until we have completed a large-scale MSM project (or projects). So I'm not sure what else I could add to make this paragraph sound much stronger. However, I could at least add a sentence somewhere in the Introduction saying that we expect high seqid structures to represent a subset of native-like states, and low seqid structures to aid in sampling more distant regions of phase space.

I agree that we can't make concrete claims here without proof. The equivocal language will have to stay.

jchodera commented 9 years ago

The clustering is conducted on the set of all models for a given target, which are derived from the set of all template PDB chains. It is not done on a per-PDB basis. The overall statistics can be seen in the pipeline figure, e.g. indicating that of 4248 Src models, 4093 were found to be unique. I could go back and analyze the statistics on a per-PDB basis - average ratio of unique/non-unique chains, or something - if that would be useful?

Not sure that would add significant value. What do you think @sonyahanson?

jchodera commented 9 years ago

I think it is reasonable to say that homology models are generally considered "low-resolution" (even those with high sequence identity). Suggestions for an alternative phrase are welcome though.

I'm with @sonyahanson on this. We should avoid the phrase "low-resolution", lest the structural biologists burn us at the stake with suggesting these are like "low-resolution" experimentally-determined structures.

sonyahanson commented 9 years ago

Just a few comments to your comments on my comments below.

(1) The MSM comment was meant strictly in terms of explaining the motivation for the work in the Introduction, and not 'evaluating the work' or trying to make any sort of concrete claims. Below is my attempt at this.

The wealth of structural data now available for certain families of proteins is tremendous, but often the exact conformation of interest isn't available for the exact protein of interest. Even with recent advances in molecular simulation, a full sampling of the conformational landscape of a protein can require vast computational resources. However, the structural data from individual members of a protein family can give clues toward the states accessible to other members of that family. Thus, the similarity between members of a given protein family can be exploited to generate arrays of conformational models, which can be used as starting configurations. This has the potential to improve sampling of higher energy states of proteins in MD simulations. This approach would be highly beneficial for certain MD methods. MSM construction, for example, requires global coverage of the conformational landscape to realize its full potential. Additionally, these models would be particularly useful in cases where structural data is present for only a subset of the members of a protein family. It would also aid in studying protein families known to have multiple metastable conformations---such as kinases---for which the combined body of structural data for the family may cover a large range of these conformations, while the available structures for any individual member might encompass only one or two distinct conformations.

(3) The PDB clustering suggestion was mostly to satisfy my own curiosity. I had honestly missed the part in Figure 1, as I do find it hard to follow, as mentioned previously.

(8) "I guess the biggest question would be how to choose representative sequences from each seqid category." ---> I don't think this matters. Just pick three from each category using your personal intuition, and say it is a 'representative' sequence alignment. I think this will already be pretty informative.

(7) Will look into some kinase references...

Also, just thought of something else: (9) In the paragraph that mentions not taking full advantage of structures etc. "largely due to limitations in software architecture" (lines 45 -66) should other software that does facilitate easy simulation set up (Charmm GUI, for example) be mentioned?

danielparton commented 9 years ago

Thanks for all the helpful comments! I have addressed these in the latest version of the text: https://github.com/choderalab/ensembler-manuscripts/blob/master/manuscript/ms.pdf Let me know if you have further comments. I'm planning to put this on biorxiv by the end of the day.

jchodera commented 9 years ago

This is looking great! I have some minor comments that I will create issues for.

The only things that need to be addressed prior to posting on BioArXiv are the TODOs. Can we remove them to Github issues before posting?

Introduction has:

[TODO: Add URL of where 128 to get the code and TK models here]

Modeling of all human tyrosine kinase domains has:

[TODO: Add list of TK UniProt identifiers and gene names. There will be a file with this info within the TK data set.]

Availability has

[TODO: Make the commands used to generate the TK data available. Perhaps in the TK model data set.]

Can we remove this to a Github issue?

Acknowledgements still has

[Add PBG, ?SMH, ?KAB support statements.]

Can we get that fixed?

danielparton commented 9 years ago

I actually just removed those TODOs :)

However, there are TODOs (uncolored) remaining in this sentence:

"The specific command-line flags and API details discussed in this paper correspond to the version [TODO] release (TODO: link)."

I was thinking I would wait until the manuscript is ready for submission to a journal, then make a new release, and link to that.

Are you ok with leaving this sentence in as it is? Or should I just remove it for the bioRxiv version?

danielparton commented 9 years ago

Regarding support statements for PBG, SMH, and KAB - do we definitely need to include these? If so, where can I get the necessary information - I guess either @jchodera or Kadeem?

kyleabeauchamp commented 9 years ago

I do not have a support acknowledgement.

kyleabeauchamp commented 9 years ago

Actually, neither does Patrick, as we found out by asking many many people.

jchodera commented 9 years ago

Regarding support statements for PBG, SMH, and KAB - do we definitely need to include these?

We should definitely thank the people that gave us the money! I suppose it's less critical for the BioArXiv version, but we have have to get this right eventually in the submission.

If so, where can I get the necessary information - I guess either @jchodera or Kadeem?

If so, where can I get the necessary information - I guess either @jchodera or Kadeem?

Here we go:

PBG acknowledges partial funding support from the Weill Cornell Graduate School of Medical Sciences
KAB was supported in part by Starr Foundation grant I8-A8-058.
JDC, KAB, and DLP acknowledge partial support from NIH grant P30 CA008748.
All authors acknowledge the generous support of this research by the Sloan Kettering Institute.

jchodera commented 9 years ago

Sorry this was so complicated, guys. The budgets are crazier than they seem.

danielparton commented 9 years ago

What about Sonya?

danielparton commented 9 years ago

This is how I've written the support statements in the text (my local changes):

JDC, KAB, and DLP acknowledge partial support from NIH grant P30 CA008748. JDC and DLP also acknowledge the generous support of a Louis V.~Gerstner Young Investigator Award. KAB was also supported in part by Starr Foundation grant I8-A8-058. PBG acknowledges partial funding support from the Weill Cornell Graduate School of Medical Sciences.

sonyahanson commented 9 years ago

Don't think I have any special support...

danielparton commented 9 years ago

Ah ok, thanks

jchodera commented 9 years ago

Sonya will be supported by the AZ sponsored research funds once that kicks in.

Let's also thank SKI at the end in the next revision.

choderalab / ensembler-manuscripts

Manuscript comments #29