DeepRank / deeprank2

An open-source deep learning framework for data mining of protein-protein interfaces or single-residue variants.
https://deeprank2.readthedocs.io/en/latest/?badge=latest
Apache License 2.0
34 stars 10 forks source link

feat: redefine potential energy terms between atom pairs, depending on bond distance #357

Closed DaniBodor closed 1 year ago

DaniBodor commented 1 year ago

Currently, we have 2 sets of van der waals parameters, depending on whether it is an inter- (higher energy) or intra-chain (lower energy) edge. The difference is approx. 10-fold for the epsilon parameter (which is used as a linear multiplier for the final "vanderwaals" edge-feature). However, this is not the correct way to deal with these parameters. Intra-chain interactions should also use the low-energy parameter if they are separated by at least 3 covalent "hops" (I'll call this covalent_distance below).

Ideally, we should determine for each edge whether it is within a cutoff covalent_distance to decide which parameter should be used. However:

Alternative options that are less accurate but computationally easy:

  1. Use a single set of van der waals parameters (probably high energy) for ALL edges, irrespective of chain identity.
    • In this case we are overestimating intra-chain edges with covalent_distance <= 3.
  2. Leave it as is, i.e. ALL intra-chain edges are considered low energy, while inter-chain edges are high energy.
    • In this case, we are underestimating intra-chain edges with covalent_distance > 3.
  3. Use covalency as a proxy for using low energy instead of the current same-chain-ness.
    • In this case we are only overestimating intra-chain edges with covalent_distance == 2 or 3.
    • This is kind of a compromise between the previous 2 options.
  4. Disconsider all intra-chain vdw interactions (i.e. set them to 0).
    • In this case we are underestimating ALL intra-chain interactions, but there might be a separate reason for doing this (I didn't understand why).
  5. Use a distance-based cutoff to determine whether an edge is within a covalent_distance of 3, analogous to the way we determine whether a bond is covalent or not.
    • Not sure how to determine a good cutoff value, given that a covalent_distance of 3 has a large full rotational freedom in 3D.
DaniBodor commented 1 year ago

@LilySnow and @cbaakman, can you take a look and tell me whether I accurately summarized the issue and what you think of the potential solutions I proposed or whether you can think of any others?

I actually think that given that we only need to know whether a bond is or is not within a covalent distance of 2 (@LilySnow, did I understand correctly that 3 or more is "far"?), it might be ok to "traverse the graph" one extra step, but I also kind of like my alternative solutions 3 and 5, which I only thought of after leaving.

cbaakman commented 1 year ago

I would prefer using a distance cutoff. (± 5 Å ?) Since covalent bond definitions are also distance-based.

Within that distance and intra-chain, use low energy vanderwaals parameters. Otherwise, high energy vanderwaals parameters.

LilySnow commented 1 year ago
  1. To identify the number of hops between two nodes, we can use adjacency matrix A. For example, if we want to identify which two atoms are separated by 3 hops (3 edges), we can use A^3 (we need to remove repeated hops, of course. This can be done easily by checking A, A^2).

For rings this method may not work well. But we are only interested atoms that are separated by 1 to 3 bonds. So all rings in the amino acids are excluded, I think.

  1. There are four numerical values in protein-allhdg5-4.param (image below). According to Alex, Column 1 and 2 are the epsilon and sigma values for atoms separated by >= 4 bonds. Columns 3 and 4 for those separated by 1 to 3 bonds. Columns 3 is almost 10 times smaller than column 1. We want to downplay the importance of Evdw for atoms seperated by 1 to 3 bonds. This is a standard think in molecular simulations. image
DaniBodor commented 1 year ago

To identify the number of hops between two nodes, we can use adjacency matrix A. For example, if we want to identify which two atoms are separated by 3 hops (3 edges), we can use A^3 (we need to remove repeated hops, of course. This can be done easily by checking A, A^2).

Is what you are suggesting here the same as what we previously called "traversing the graph"? If not, it is not clear to me how to construct this matrix.

There might be out-of-the-box solutions for finding this as well. Potentially a variation of this might work.

cbaakman commented 1 year ago

So A^3 would have the same effect as traversing the graph? Find all atom pairs with at least 3 bonds between them?

DaniBodor commented 1 year ago

This should be the way to do it (copied from slack conversation):

# for distance matrix D

A = D < 2.0
A2 = np.matmul(A, A)
A3 = np.matmul(A2, A)

C = A || A2 || A3

vdw = _get_high_vdw(D)
vdw[C] = _get_low_vdw(D)[C]
DaniBodor commented 1 year ago
# for distance matrix D

A = D < 2.0
A2 = np.matmul(A, A)
A3 = np.matmul(A2, A)

C = A || A2 || A3

vdw = _get_high_vdw(D)
vdw[C] = _get_low_vdw(D)[C]

@cbaakman , very nice solution! Should be relatively straightforward to implement.

Quick question: A3 and C are identical, right? C = A || A2 || A3 does not work for me (I'm guessing this is a python 3.10 thing, and I am working with 3.9), but I think it's the same as doing C = A + A2 + A3 In my case, assert np.all(C==A3) passes.

cbaakman commented 1 year ago

Quick question: A3 and C are identical, right? C = A || A2 || A3 does not work for me (I'm guessing this is a python 3.10 thing, and I am working with 3.9), but I think it's the same as doing C = A + A2 + A3 In my case, assert np.all(C==A3) passes.

I tested it earlier and it seemed that A3 works just as well as C. I have no mathematical proof however. I'd say just use A3.

DaniBodor commented 1 year ago

I tested it earlier and it seemed that A3 works just as well as C. I have no mathematical proof however. I'd say just use A3.

Mathematically it also makes sense. If you're multiplying true*true (1*1), as we're doing to calculate A3 for residues that are only 2 hops away, you still get true. Adding any number of falses (0) to that, still leaves it as true.

LilySnow commented 1 year ago

I chatted with Alex again today. Here are some corrections for Evdw and Eelec:

  1. We ignore atom pairs that are separated by 1 or 2 bonds. Otherwise Evdw will be very high.
  2. For atom pairs that are separated by 3 bonds (the so-called 1-4 interactions in Molecular Dynamics), we use the last two columns of the parameter file.

This is also described here: the “Exclusions from Non-Bonded Interactions” section of Practical considerations for Molecular Dynamics

DaniBodor commented 1 year ago

What that section says is:

Pairs of atoms connected by chemical bonds are normally excluded from computation of non-bonded interactions because bonded energy terms replace non-bonded interactions. In biomolecular force fields all pairs of connected atoms separated by up to 2 bonds (1-2 and 1-3 pairs) are excluded from non-bonded interactions.

However, in deeprankcore we are not explicitly using any bonded energies, so there is nothing to replace them in our case. We do have a a feature for covalency, but that is only for the 1-2 interaction, not the 1-3 interaction. Moreover, this is a binary feature, not a continuous value, as you would expect for energies.

A few thoughts:

LilySnow commented 1 year ago

Thanks for these thoughts.

  1. For bond energies, we can of course add them later.
  2. Indeed, this issue applies to both Vdw and Eelec
  3. Yes, it is quite standard in molecular dynamics that we sum up energies
  4. For the combining rules, we choose the current one because we are using OPLS potentials and its equations. I know it is old... that's what haddock is using...
DaniBodor commented 1 year ago

Thanks for clarifying, Li!

Another thought occurred to me now: Is there any reason we are using the different potentials as separate features? I guess that atoms don't "experience" vdw energy separate from bond energy, separate from electrostatic energy. I guess we should have a sum energy between each atom pair that sums all the energies into a single feature. I guess for MD simulations, they can remain separate because the simulation will integrate all of that anyway (not sure if this is true, I have never done such simulations myself), but in our case if we keep them separate, they will be treated as separate entities by the network. We can of course also have each potential separate and then have an additional feature that is the sum of all of them.

In any case, to me it seems weird to remove the non-bond potentials for 1-2 and 1-3 pairs, before adding back a different potential to replace it with. If we do that, suddenly there is no potential between these pairs whatsoever.

DaniBodor commented 1 year ago

Summary of meeting of March 2nd:

LilySnow commented 1 year ago

For 1-4 pairs, we could use the low energy parameters as we do now. Let's keep our modification as simple as possible as I think we may have to revisit this issue later

DaniBodor commented 1 year ago

For 1-4 pairs, we could use the low energy parameters as we do now. Let's keep our modification as simple as possible as I think we may have to revisit this issue later

The simplest solution would be to use high energy parameters for 1-4 pairs, because currently we are using intra-residue status as a proxy. The second simplest solution would be a (newly defined) cutoff distance for this as well.

DaniBodor commented 1 year ago

Below are the results from my cutoff distance test. The last one (1ATN_1w) is the best to look at, largest and most complex structures. Code for this can be found in './tests/features/determine_contact_cutoffs.ipynb' of this branch

My gut feeling is that cutoff of 3.6/4.2 A cutoff distances for 1-3/1-4 pairing are the best compromise for including the vast majority of true positives while minizing false positives, despite if 2.9/3.6 give slightly higher f1 score.

image

DaniBodor commented 1 year ago

I wonder whether we can use PDB nomenclature to define adjacency instead of using the adjacency matrix. In this way, anything that is not in the same or neighboring residues would always be >1-4 pairing. Within one residue distance, use nomenclature of CA, CB, etc, and convert this to a bond distance. Similar rules can be applied to adjacent residues, where only few options exist to be within a 1-4 pairing distance.

The same logic could be used for applying the covalent feature, instead of the current distance cutoff.

(I made the same comment here)