choderalab / ensembler-manuscripts

Manuscript for Ensembler v1
0 stars 3 forks source link

Model superposition figure #10

Closed danielparton closed 9 years ago

danielparton commented 9 years ago

This is a first pass attempt at a superposition of models for Src. I'll also do Abl once we have settled upon a style for this figure.

14 models are shown, selected at (roughly) regular intervals along the sequence identity distribution. Models are colored and given a transparency based on sequence identity (100% seq identity = opaque; 0% seq identity = completely transparent). The model indices and sequence identities are as follows:

     0  100.0
   116   97.0
   117   68.7
   190   64.8
   192   64.4
   194   51.6
   201   50.0
   319   44.9
   602   40.1
   935   35.0
  1225   30.0
  3140   25.0
  4207   20.1
  4244   16.3

The models are chosen by taking seq identity values at regular intervals from 0-100, and finding the models with the closest seq identity. Duplicates are removed, hence the uneven distribution of values above. (There's probably a better way of doing this.)

I've been trying to get a colormap type thing working, so models are also colored based on seq identity. That should help distinguish the models even more.

You can see quite high variance in the activation loop (right-hand side between the two lobes), and what looks to be a fairly reasonable degree of variance in the other areas.

jchodera commented 9 years ago

I like this!

In addition, can you try superimposing models selected from each of the seqid bins you used for the other figure? That way we can also visualize the ensembles as a function of seqid classes.

danielparton commented 9 years ago

Same figure with the colormap I mentioned. I think I have a slight preference for this compared to the all-black representation. (@jchodera - I'm also working on the figure you suggested)

jchodera commented 9 years ago

Is the colormap only grey to blue? Maybe red to blue would be good?

danielparton commented 9 years ago

Blue to red version (white in the middle):

danielparton commented 9 years ago

In this version, models are picked randomly (without replacement) from each three sequence identity classes: 0-35% 35-55% 55-100%

Three models from each class, for a total of nine.

Coloring/transparency is based on sequence identity as before.

I think I slightly prefer this version, as it shows a similar amount of model variation with a lesser number of models (9 vs 14).

sonyahanson commented 9 years ago

For whatever reason, I kind of prefer the black and white version of this... (2 cents)

danielparton commented 9 years ago

Here's a version with models picked from k-medoids clustering on RMSD. It does a nice job of picking out models over a range of sequence identities. However, the visual appearance of the figure is rather sensitive to the number of clusters. I kind of prefer the look of the previous figure (9 models picked from three sequence identity classes), but maybe we should choose the figure based on the method used, not the way it looks.

Number of models selected: 9
    47   99.2
   141   68.1
   307   45.7
   443   41.7
   896   36.3
  1113   32.3
  2254   26.0
  2753   26.0
  3591   23.6

Number of models selected: 8
     6  100.0
    81   99.2
   175   67.7
   328   44.1
   443   41.7
  2308   26.0
  3591   23.6
  4102   22.3
jchodera commented 9 years ago

Is it possible to have 3 subfigures, one for each seqid class?

danielparton commented 9 years ago

Three subfigures, one for each seqid class (transparency removed):

jchodera commented 9 years ago

Love it!

jchodera commented 9 years ago

Are the structures within each subplot selected randomly, equally spaced in seqid, or clustered?

danielparton commented 9 years ago

Randomly. These are the model indices and seqids:

    39   99.2
   125   68.1
   160   67.7
   471   41.3
   806   37.4
   828   36.6
   978   34.6
  1826   27.2
  4185   20.9
jchodera commented 9 years ago

Would it be a pain to try to cluster the conformations in each seqid class?

danielparton commented 9 years ago

No, should be simple enough. I'll give it a try

danielparton commented 9 years ago

Ok, here's a version made using 3 clusters per seqid class. Cluster centroids are shown.

Number of models selected: 9
     6  100.0
   131   68.1
   175   67.7
   328   44.1
   400   42.4
   560   40.6
  1751   28.0
  1760   27.7
  3579   23.6
jchodera commented 9 years ago

Looks nice!

How did you choose the magic number of 3 centroids per plot?

danielparton commented 9 years ago

Just so there would be 9 models...

What would be a better way of choosing the number of clusters per seqid class?

jchodera commented 9 years ago

9 models is arbitrary, yes? Or was there a reason for that?

Choosing the number of models for clustering (in general) is a difficult problem. Often, measures of intracluster variance, or a metric (e.g. the sum of intracluster variances) can help.

For our purposes, I think we just want to give an idea of what the diversity looks like. 3 is pretty uncluttered, and may be optimal for that. But it's possible 4 or 5 might still be useful to look at.

I'd say we leave things as is for now and move onto the other things, like computing the interatomic distances associated with kinase activation for Src and Abl.

danielparton commented 9 years ago

Yes, 9 models is arbitrary. The only reasoning was that it looked fairly good (i.e. useful information vs. clutter) when plotting 9 superimposed models. I'll move on for now.

danielparton commented 9 years ago

Turns out I'd done the seqid colormaps wrongly for this figure too. Corrected in this figure, and put all models in one superposition following discussion with John. Abl1 superposition is also included here.

To recap: clustering (k=3) is performed on the models within each of three seqid ranges (0-35, 35-55, 55-100). The nine centroid models are rendered, with sequence identity mapped to both color and transparency.

These are the model indices and sequence identities:

Src:
    11  100.0
    21   99.2
   117   68.1
   286   44.1
   348   42.4
   751   36.6
  1186   29.2
  1593   28.0
  3375   24.0

Abl1:
    11  100.0
    21  100.0
    47   99.6
   180   48.8
   388   43.9
   802   39.2
  2438   25.5
  3090   23.9
  3143   23.9
jchodera commented 9 years ago

Looks good! The clustering procedure might still be weird to describe, but let's stick with it for now.