choderalab / ensembler-manuscripts

Manuscript for Ensembler v1
0 stars 3 forks source link

Residue pair distances figure #12

Closed danielparton closed 9 years ago

danielparton commented 9 years ago

This is for Src, and displays the distances for the residue pairs K295-E310 and E310-R409.

K295 is required for catalysis. E310 and R409 are also conserved residues. The general idea from the literature is that the formation of a salt-bridge between K295 and E310 constitutes the inactive state, as it takes K295 out of the position necessary for catalysis. In the active state, E310 instead forms a salt-bridge with R409, and K295 is free to contribute to catalysis.

So our models populate a nice broad range within this space.

danielparton commented 9 years ago

Coloring is by sequence identity

bas-rustenburg commented 9 years ago

Since you're using 3 colors maybe a color bar would be useful? Not sure how much work that would be. It could be part of the caption instead.

jchodera commented 9 years ago

Is there a way to make the stars stand out more? They're hard to see!

We also probably want a kinase figure illustrating which distances these are and showing them in 2SRC and 1Y57.

jchodera commented 9 years ago

Also, can we make this square, with the same limits on both axes?

danielparton commented 9 years ago

Turns out I'd got the sequence identity coloring implementation completely wrong. Fixed here, and this also makes the stars stand out more - is this sufficient? Also made the aspect ratio square.

jchodera commented 9 years ago

What's going on with the band of high-seqid low-K295-E310 distance? Is this a third state that is important?

danielparton commented 9 years ago

That band represents the fully formed K295-E310 salt-bridge. My figure would seem to indicate that this salt-bridge is not fully formed in the 1Y57 structure. However, this may be due to the way I measured the distances, by taking the center of mass of the final three heavy atoms of the residue sidechains. I think I should change this to the minimum distance between those atoms. From looking at the 1Y57 structure, I expect this will result in a shorter K295-E310 distance.

kyleabeauchamp commented 9 years ago

See md.compute_contacts for minimum distance calculation

danielparton commented 9 years ago

Here's the modified figure made with the minimum distance contacts I described (using md.compute_contacts - thanks Kyle). The reference structures fit very nicely with the models data now. I also modified the colormap to match other figures.

danielparton commented 9 years ago

Figures for both Src and Abl.

Src:

Abl1:

jchodera commented 9 years ago

Looks good, but are the residue numberings really the same for Src and Abl?

danielparton commented 9 years ago

Good point. I've checked a few Abl papers and changed this to the canonical numbering scheme.

jchodera commented 9 years ago

Let's make a point of specifying which numbering scheme we used in the figure caption and text.

danielparton commented 9 years ago

That sounds very good to me (if only every paper did that...). How should we specify the numbering scheme though - just cite a relevant paper?

The numbering scheme for Src is not the same as UniProt, but I think it's probably better to go with the "canonical" numbering scheme, i.e. that which is used in most papers.

jchodera commented 9 years ago

Not sure. @sonyahanson: any ideas?

The safest option is always to also list the sequences and corresponding numberings in the SI so there can be no uncertainty about numbering.

danielparton commented 9 years ago

Yeah I think it would be fine to cite a structure paper and also list the sequences with numberings in the SI.

sonyahanson commented 9 years ago

Seems common for Src papers to use chicken Src numbering instead of human Src numbering, even if they are referring to the human protein. For example, Nick Levinson uses chicken Src numbering instead of human Src numbering in his 2014 Nature Chemical Biology paper, despite that fact that he solves three structures of human Src. This means, for example, gatekeeper mutant is T338 instead of T341, as it would be according to human uniprot numbering. In that paper, this is annotated the first time any residue numbers are mentioned: "a conserved, catalytically important, glutamate residue (Glu310 in Src, chicken c-Src numbering), which...".

jchodera commented 9 years ago

Good point. For simplicity, what if we used Uniprot numbering throughout our paper and figures, mentioned the equivalent canonical and chicken-Src residue numbers in the text and figure, and then include an SI figure for Src and Abl with the major numbering conventions?

This raises another important question: What number strategy do our resulting models end up using?

danielparton commented 9 years ago

I actually think we should use the standard literature numbering system (i.e. use chicken-Src numbering scheme for human-Src) in our paper, since that is what most/all other papers use. This will make it easier for kinase experts to read and understand the paper. For example, people often talk about the Abl T315I mutation, and I think that changing the numbering scheme would rather confuse people.

UniProt numbering makes much more sense when working with an automated pipeline, but I think we should translate to the literature standard in our papers. I don't think we need to mention UniProt numbering at all in the paper.

Ensembler model resids are reset to a 1-based system, and so in many cases will not correspond to the literature or UniProt numbering schemes. My preference would be to stick with this system - we just need to be aware of it when comparing with other numbering schemes. Changing the Ensembler code to use an external numbering scheme would require the user to supply this numbering scheme along with the sequence.

sonyahanson commented 9 years ago

Is T315 not uniprot for human Abl1?

danielparton commented 9 years ago

Ok, that was a bad example since the UniProt numbering is the same as the "canonical literature" numbering. The Src T338 gatekeeper mutant is a better example - changing it to T341 could be a source of confusion.

jchodera commented 9 years ago

This sounds reasonable, but I worry about:

I think the only way for this to really scale or apply to other systems is to use Uniprot numbering throughout the models and scripts. We can certainly use the kinase-specific numbering in our paper (with a mention of the Uniprot numbering), but elsewhere, we should be careful what we use.

danielparton commented 9 years ago

This sounds reasonable, but I worry about:

  • what do we do in general for other kinases when historical number is not available?

What do you mean by "historical number"? I'm going to use "canonical literature numbering" to refer to the numbering scheme most commonly used in the literature (e.g. chicken-Src numbering for Src) - is this what you mean? If the canonical literature numbering is not clear, then we can just use UniProt numbering?

  • what numbering scheme do we use for our models?

I think the only way for this to really scale or apply to other systems is to use Uniprot numbering throughout the models and scripts. We can certainly use the kinase-specific numbering in our paper (with a mention of the Uniprot numbering), but elsewhere, we should be careful what we use.

Ok, well I can implement this in the code, and write a script to modify the existing models on the cluster. I think this should be less than one day of work, if all goes well (...)

Btw, I don't think this will be a common situation, but I'm wondering what we would do if we ever want to use an insertion mutant as a target. We could do ['1', '2', '2A', '2B', '3', ...], for example, for a two-aa insertion between UniProt residues 2 and 3.

jchodera commented 9 years ago

What do you mean by "historical number"? I'm going to use "canonical literature numbering" to refer to the numbering scheme most commonly used in the literature (e.g. chicken-Src numbering for Src) - is this what you mean?

Yep! Sorry for being unclear.

Ok, well I can implement this in the code, and write a script to modify the existing models on the cluster. I think this should be less than one day of work, if all goes well (...)

Sorry for having neglected this earlier. I like the idea of having an optional final stage that renumbers models to have the desired residue numbering. It may be useful to allow user-specified numbering schemes (in case they want to use a canonical or historical numbering scheme) with the Uniprot numbering scheme being the default.

Btw, I don't think this will be a common situation, but I'm wondering what we would do if we ever want to use an insertion mutant as a target. We could do ['1', '2', '2A', '2B', '3', ...], for example, for a two-aa insertion between UniProt residues 2 and 3.

It seems like there should be a standard here. What does the PDB say about this?

danielparton commented 9 years ago

Turned out the annotations of 1Y57 and 2SRC which I was using were mislabeled - the former is active and the latter inactive. Figures are corrected here. Also note that we will use these two reference structures (both Src) for both the Src and Abl distance plots.

I've also added Src renderings to show the distances being measured. Note that the ANP molecule rendered in the active state (1Y57) is not present in the original crystal structure - it is simply copied from the 2SRC structure and aligned according to the surrounding protein residues.

danielparton commented 9 years ago

Ok, I've written a residue renumbering routine which outputs PDB files for each target with the UniProt residue numbering. It does this for the implicit-solvent and (if available) explicit-solvent model files. Solvent (water and ions) in the explicit solvent files are given chainid "B" and residue numbers starting from 1.

It's working fine on my laptop, and I've just scheduled a job on the cluster which will run through the TKs.

We'll punt for now the question of what to do about target sequences with mutations, insertions or deletions.

jchodera commented 9 years ago

Excellent!