Attrition rates data/figure

danielparton commented 9 years ago

Here is the attrition rate data. The figure included here is just to provide a quick visualization. I'm not sure yet how (or even if) we would want to visualize this data in the actual paper.

Data is gathered at four stages in the Ensembler pipeline, across all 90 TK targets and all templates. For the counts data, the four stages have the following meanings:

templates - total number of target-template pairs
models - total number of models successfully built by Modeller
unique - total number of unique models, following RMSD-based clustering
_implicitrefinement - total number of models which make it through implicit refinement

The rates data is generated by dividing the counts for each stage by the counts from the previous stage. This data is depicted in the figure.

Data is also stratified by sequence identity, using the same ranges as in the RMSD distribution figure.

# Counts
                     all    0-35  35-55  55-101
templates         398970  343393  50643    4934
models            382568  333564  44602    4402
unique            378839  332055  42907    3877
implicit_refined  373513  327250  42431    3832

# Rates
                   all  0-35  35-55  55-101
templates         1.000 1.000  1.000   1.000
models            0.959 0.971  0.881   0.892
unique            0.990 0.995  0.962   0.881
implicit_refined  0.986 0.986  0.989   0.988

Analysis:

The Modeller step is the biggest source of attrition during the modeling process.

Looking at the attrition rates as a function of sequence identity, the unique-stage data simply shows that high sequence identity templates are more likely to produce non-unique models.

As for the model-stage data, I'm not sure how to interpret that - I can't think of an obvious reason why the 35-55% sequence identity range should show greater attrition than the others. Perhaps there is one set of templates with similar structures which are particularly difficult to model for whatever reason. The sample size is also an order of magnitude less than the 0-35% range. I'd have to some more detailed analysis to understand this.

Finally, at the implicit solvent MD stage, attrition rates do not seem to be greatly affected by sequence identity.

Any thoughts on how to present this data in the paper? Maybe we should have a meeting in-person to discuss.

jchodera commented 9 years ago

Let's chat this afternoon! I don't think we need to spend time trying to interpret this data. I think we just want a funnel diagram showing the number of surviving models at each stage for Src and Abl and the collected statistics of all TKs.

J On Feb 22, 2015 11:40 PM, "Daniel Parton" notifications@github.com wrote:

Here is the attrition rate data. The figure included here is just to provide a quick visualization. I'm not sure yet how (or even if) we would want to visualize this data in the actual paper.

Data is gathered at four stages in the Ensembler pipeline, across all 90 TK targets and all templates. For the counts data, the four stages have the following meanings:

templates - total number of target-template pairs

models - total number of models successfully built by Modeller

unique - total number of unique models, following RMSD-based clustering

_implicitrefinement - total number of models which make it through implicit refinement

The rates data is generated by dividing the counts for each stage by the counts from the previous stage. This data is depicted in the figure.

Data is also stratified by sequence identity, using the same ranges as in the RMSD distribution figure.

https://github.com/choderalab/ensembler-manuscripts/raw/c56c7889ef28d9d4dd195e9f561dff8237e79166/figures/attrition/attrition_rates.png

Counts
                 all    0-35  35-55  55-101
templates 398970 343393 50643 4934 models 382568 333564 44602 4402 unique 378839 332055 42907 3877 implicit_refined 373513 327250 42431 3832

Rates
               all  0-35  35-55  55-101
templates 1.000 1.000 1.000 1.000 models 0.959 0.971 0.881 0.892 unique 0.990 0.995 0.962 0.881 implicit_refined 0.986 0.986 0.989 0.988

Analysis:

The Modeller step is the biggest source of attrition during the modeling process.

Looking at the attrition rates as a function of sequence identity, the unique-stage data simply shows that high sequence identity templates are more likely to produce non-unique models.

As for the model-stage data, I'm not sure how to interpret that - I can't think of an obvious reason why the 35-55% sequence identity range should show greater attrition than the others. Perhaps there is one set of templates with similar structures which are particularly difficult to model for whatever reason. The sample size is also an order of magnitude less than the 0-35% range. I'd have to some more detailed analysis to understand this.

Finally, at the implicit solvent MD stage, attrition rates do not seem to be greatly affected by sequence identity.

Any thoughts on how to present this data in the paper? Maybe we should have a meeting in-person to discuss.

— Reply to this email directly or view it on GitHub https://github.com/choderalab/ensembler-manuscripts/issues/9.

danielparton commented 9 years ago

I've put the model counts (attrition) data in a table in the manuscript - Table I. This is separate from the funnel diagram, which now just depicts the various Ensembler stages.

I've also calculated the average timings (per template/model) for the most compute-intensive Ensembler stages - those are in Table II.

How does this look?

jchodera commented 9 years ago

This is fine for now, but I think it would look better if we could incorporate these into Fig. 1 at some point. The main reason is that Table 1 and Table 2 have different ways of specifying the different stages of calculations ("Templates" vs "Template reconstruction"; "Models" vs "Modeling") and it's unclear to me how these data relate to the stages depicted in Fig 1. Having them all together would be much more communicative, I think.

Is Fig 1 supposed to be a one-column or a two-column figure? We could enlarge it into a two-column figure to include this data alongside it.

danielparton commented 9 years ago

I agree it would be good to have a nice visual correspondence between this data and the funnel diagram. It's a bit difficult to do that with the current diagram, as it is not really linear in flow. One option would be to make a more linear version. I have some ideas. Can we go over these quickly in-person with a whiteboard at some point, before I continue modifying the figure?

jchodera commented 9 years ago

Definitely! 3.00P tomorrow OK?

danielparton commented 9 years ago

Sounds good

On Tue, Feb 24, 2015 at 7:36 PM, John Chodera notifications@github.com wrote:

Definitely! 3.00P tomorrow OK?

— Reply to this email directly or view it on GitHub https://github.com/choderalab/ensembler-manuscripts/issues/9#issuecomment-75882256 .

danielparton commented 9 years ago

New pipeline figure. How does this look? It takes up about half a page in the manuscript as a two-column figure. Is that reasonable? The font size looks about right at that size.

sonyahanson commented 9 years ago

Pretty! I feel like there are a lot of different fonts, though...

jchodera commented 9 years ago

Content and layout are great
Too much clutter: Ditch the shading and proliferation of rules; a single rule below "All TKs SRC ABL Timing" would probably suffice
Too many fonts. Maybe we could get by with 2-3 fonts?

jchodera commented 9 years ago

I count at least five fonts right now, including a mixture of serif, sans-serif, and all caps.

jchodera commented 9 years ago

Also, "# Models" seems out of place; can you say "Number of models" or even just "Models"?

jchodera commented 9 years ago

"cpu(h)" also seems weird. "CPU-h" is maybe better?

jchodera commented 9 years ago

We'll drop in the solvation with explicit water stats once we finish that for Src and Abl, right?

danielparton commented 9 years ago

New pipeline figure to address these points. I think this looks a lot cleaner. Also, I'm running the solvation code for Src and Abl at the moment, and will add that data into the table once it's done.

jchodera commented 9 years ago

Looks great!

jchodera commented 9 years ago

This looks good now. Closing.

jchodera commented 9 years ago

Whoops. Would be helpful to add the Src and Abl explicit solvent info too.

danielparton commented 9 years ago

Updated with explicit solvent model counts. Is this ready to close now?

jchodera commented 9 years ago

Timing data?

danielparton commented 9 years ago

Updated with timing data:

jchodera commented 9 years ago

Looking good!

I just realized the simulation stage has timings quoted as "CPU-h". Do we want to quote these as "GPU-h"? Also, which GPUs were used---the GTX-680s or GTX-Titans, or a mix?

danielparton commented 9 years ago

It would have been a mix of 680s and Titans. I'll change CPU-h this to GPU-h for the simulation stage

On Fri, Mar 6, 2015 at 4:39 PM, John Chodera notifications@github.com wrote:

Looking good!

I just realized the simulation stage has timings quoted as "CPU-h". Do we want to quote these as "GPU-h"? Also, which GPUs were used---the GTX-680s or GTX-Titans, or a mix?

— Reply to this email directly or view it on GitHub https://github.com/choderalab/ensembler-manuscripts/issues/9#issuecomment-77640871 .

choderalab / ensembler-manuscripts

Attrition rates data/figure #9

Counts

Rates