RMSD distribution figure

danielparton commented 9 years ago

Here's the first pass. What do you think?

This is the distribution of RMSD wrt highest sequence identity model, over various sequence identity ranges. Data is accumulated from all models generated from all 90 tyrosine kinase targets.

I used the seaborn kdeplot function, which uses KDE with Gaussian kernels to plot the estimated probability distributions.

For reference, here is the distribution of sequence identities (again data is accumulated across all TK targets and templates):

RMSD distribution

kyleabeauchamp commented 9 years ago

Looks cool but has a lot of stuff going on, I wonder if there's a way to simplify

jchodera commented 9 years ago

I really like the RMSD as a function of sequence identity figure, but @kyleabeauchamp does have a point. How about we keep it for now and brainstorm some ways to simplify it.

I wonder if, for example, we can communicate the number of sequences in each class and make the distributions more distinct by using unnormalized kernel-smoothed histograms. You would multiply the distribution in each sequence identity class by the number of sequences n in that class, and shade the area under that curve a different color to make it more distinct. The overall RMSD vs sequence identity distribution would be an unshaded black line that shows the sum of all the smoothed distributions.

jchodera commented 9 years ago

A KDE-smoothed version of the distribution of sequence identities (maybe show both the PDF and CDF) would also be useful.

danielparton commented 9 years ago

Hmm, I have a feeling that the unnormalized distributions will be too distinct - I expect that anything except the 20-40% sequence identity range will be too small to see properly, given the magnitude differences in n. Also, will require some recoding...

In the mean time, how about this as an alternative? I just reduced the number of sequence identity ranges to three, and dispensed with the 0-100 range.

This still shows the overall trend of increasing RMSD with decreasing sequence identity. And I think the detail we lose was probably not very informative in the first place.

kyleabeauchamp commented 9 years ago

Looks good, I think 3 distributions is much easier to follow.

On Wed, Feb 18, 2015 at 7:19 PM, Daniel Parton notifications@github.com wrote:

Hmm, I have a feeling that the unnormalized distributions will be too distinct - I expect that anything except the 20-40% sequence identity range will be too small to see properly, given the magnitude differences in n. Also, will require some recoding...

In the mean time, how about this as an alternative? I just reduced the number of sequence identity ranges to three, and dispensed with the 0-100 range.

https://github.com/choderalab/ensembler-manuscripts/raw/0894941bfd93a4e92fb05358cb38da92423f3b5b/figures/rmsddist/rmsddist2.png

This still shows the overall trend of increasing RMSD with decreasing sequence identity. And I think the detail we lose was probably not very informative in the first place.

— Reply to this email directly or view it on GitHub https://github.com/choderalab/ensembler-manuscripts/issues/8#issuecomment-74977464 .

danielparton commented 9 years ago

Here is the sequence identity distribution (KDE-smoothed), showing both density and cumulative density. (Note: the previous version was also KDE-smoothed, but I increased the kernel bandwidth for this version.)

jchodera commented 9 years ago

Is the sequence identity distribution including for all template-target pairs?

Also, we're probably interested in the CDF in the other direction, from 100% down to 0%.

Traditionally, comparative modeling has a few well-defined ranges: http://en.wikipedia.org/wiki/Homology_modeling#Accuracy

<30% is the "twilight zone" where things are inaccurate
30-50% is where errors are more severe
50-100% is generally very good

However, your distribution suggests a 0-35% range, 35-55%, and >55% range.

danielparton commented 9 years ago

@jchodera Yes, sequence identity distribution is for all template-target pairs. Working on the other suggestions now.

danielparton commented 9 years ago

Updated rmsd distribution plot with sequence identity ranges of 0-35%, 35-55%, and >55%. Also added below-the-line shading.

Updated sequence identity distribution, from 100% sequence identity to 0% sequence identity. (matplotlib seems to have lost the right-hand axis label - not sure yet how to fix that)

jchodera commented 9 years ago

Looking good, but we should fix the "55-101%" to be "55-100%".

danielparton commented 9 years ago

Yep, I fixed that and edited my previous post with the corrected figure.

Also, here is an updated figure for the sequence identity distribution:

jchodera commented 9 years ago

Thanks! This is great!

You don't need numeric labels on the "probability density" axis (e.g. 0.00, 0.02, 0.04...), but this looks great otherwise!

danielparton commented 9 years ago

Updated:

choderalab / ensembler-manuscripts

RMSD distribution figure #8