Closed danielparton closed 9 years ago
Let's chat this afternoon! I don't think we need to spend time trying to interpret this data. I think we just want a funnel diagram showing the number of surviving models at each stage for Src and Abl and the collected statistics of all TKs.
J On Feb 22, 2015 11:40 PM, "Daniel Parton" notifications@github.com wrote:
Here is the attrition rate data. The figure included here is just to provide a quick visualization. I'm not sure yet how (or even if) we would want to visualize this data in the actual paper.
Data is gathered at four stages in the Ensembler pipeline, across all 90 TK targets and all templates. For the counts data, the four stages have the following meanings:
- templates - total number of target-template pairs
- models - total number of models successfully built by Modeller
- unique - total number of unique models, following RMSD-based clustering
- _implicitrefinement - total number of models which make it through implicit refinement
The rates data is generated by dividing the counts for each stage by the counts from the previous stage. This data is depicted in the figure.
Data is also stratified by sequence identity, using the same ranges as in the RMSD distribution figure.
Counts
all 0-35 35-55 55-101
templates 398970 343393 50643 4934 models 382568 333564 44602 4402 unique 378839 332055 42907 3877 implicit_refined 373513 327250 42431 3832
Rates
all 0-35 35-55 55-101
templates 1.000 1.000 1.000 1.000 models 0.959 0.971 0.881 0.892 unique 0.990 0.995 0.962 0.881 implicit_refined 0.986 0.986 0.989 0.988
Analysis:
The Modeller step is the biggest source of attrition during the modeling process.
Looking at the attrition rates as a function of sequence identity, the unique-stage data simply shows that high sequence identity templates are more likely to produce non-unique models.
As for the model-stage data, I'm not sure how to interpret that - I can't think of an obvious reason why the 35-55% sequence identity range should show greater attrition than the others. Perhaps there is one set of templates with similar structures which are particularly difficult to model for whatever reason. The sample size is also an order of magnitude less than the 0-35% range. I'd have to some more detailed analysis to understand this.
Finally, at the implicit solvent MD stage, attrition rates do not seem to be greatly affected by sequence identity.
Any thoughts on how to present this data in the paper? Maybe we should have a meeting in-person to discuss.
— Reply to this email directly or view it on GitHub https://github.com/choderalab/ensembler-manuscripts/issues/9.
I've put the model counts (attrition) data in a table in the manuscript - Table I. This is separate from the funnel diagram, which now just depicts the various Ensembler stages.
I've also calculated the average timings (per template/model) for the most compute-intensive Ensembler stages - those are in Table II.
How does this look?
This is fine for now, but I think it would look better if we could incorporate these into Fig. 1 at some point. The main reason is that Table 1 and Table 2 have different ways of specifying the different stages of calculations ("Templates" vs "Template reconstruction"; "Models" vs "Modeling") and it's unclear to me how these data relate to the stages depicted in Fig 1. Having them all together would be much more communicative, I think.
Is Fig 1 supposed to be a one-column or a two-column figure? We could enlarge it into a two-column figure to include this data alongside it.
I agree it would be good to have a nice visual correspondence between this data and the funnel diagram. It's a bit difficult to do that with the current diagram, as it is not really linear in flow. One option would be to make a more linear version. I have some ideas. Can we go over these quickly in-person with a whiteboard at some point, before I continue modifying the figure?
Definitely! 3.00P tomorrow OK?
Sounds good
On Tue, Feb 24, 2015 at 7:36 PM, John Chodera notifications@github.com wrote:
Definitely! 3.00P tomorrow OK?
— Reply to this email directly or view it on GitHub https://github.com/choderalab/ensembler-manuscripts/issues/9#issuecomment-75882256 .
New pipeline figure. How does this look? It takes up about half a page in the manuscript as a two-column figure. Is that reasonable? The font size looks about right at that size.
Pretty! I feel like there are a lot of different fonts, though...
I count at least five fonts right now, including a mixture of serif, sans-serif, and all caps.
Also, "# Models" seems out of place; can you say "Number of models" or even just "Models"?
"cpu(h)" also seems weird. "CPU-h" is maybe better?
We'll drop in the solvation with explicit water stats once we finish that for Src and Abl, right?
New pipeline figure to address these points. I think this looks a lot cleaner. Also, I'm running the solvation code for Src and Abl at the moment, and will add that data into the table once it's done.
Looks great!
This looks good now. Closing.
Whoops. Would be helpful to add the Src and Abl explicit solvent info too.
Updated with explicit solvent model counts. Is this ready to close now?
Timing data?
Updated with timing data:
Looking good!
I just realized the simulation stage has timings quoted as "CPU-h". Do we want to quote these as "GPU-h"? Also, which GPUs were used---the GTX-680s or GTX-Titans, or a mix?
It would have been a mix of 680s and Titans. I'll change CPU-h this to GPU-h for the simulation stage
On Fri, Mar 6, 2015 at 4:39 PM, John Chodera notifications@github.com wrote:
Looking good!
I just realized the simulation stage has timings quoted as "CPU-h". Do we want to quote these as "GPU-h"? Also, which GPUs were used---the GTX-680s or GTX-Titans, or a mix?
— Reply to this email directly or view it on GitHub https://github.com/choderalab/ensembler-manuscripts/issues/9#issuecomment-77640871 .
Here is the attrition rate data. The figure included here is just to provide a quick visualization. I'm not sure yet how (or even if) we would want to visualize this data in the actual paper.
Data is gathered at four stages in the Ensembler pipeline, across all 90 TK targets and all templates. For the counts data, the four stages have the following meanings:
The rates data is generated by dividing the counts for each stage by the counts from the previous stage. This data is depicted in the figure.
Data is also stratified by sequence identity, using the same ranges as in the RMSD distribution figure.
Analysis:
The Modeller step is the biggest source of attrition during the modeling process.
Looking at the attrition rates as a function of sequence identity, the unique-stage data simply shows that high sequence identity templates are more likely to produce non-unique models.
As for the model-stage data, I'm not sure how to interpret that - I can't think of an obvious reason why the 35-55% sequence identity range should show greater attrition than the others. Perhaps there is one set of templates with similar structures which are particularly difficult to model for whatever reason. The sample size is also an order of magnitude less than the 0-35% range. I'd have to some more detailed analysis to understand this.
Finally, at the implicit solvent MD stage, attrition rates do not seem to be greatly affected by sequence identity.
Any thoughts on how to present this data in the paper? Maybe we should have a meeting in-person to discuss.