feat: redefine potential energy terms between atom pairs, depending on bond distance

DaniBodor commented 1 year ago

Currently, we have 2 sets of van der waals parameters, depending on whether it is an inter- (higher energy) or intra-chain (lower energy) edge. The difference is approx. 10-fold for the epsilon parameter (which is used as a linear multiplier for the final "vanderwaals" edge-feature). However, this is not the correct way to deal with these parameters. Intra-chain interactions should also use the low-energy parameter if they are separated by at least 3 covalent "hops" (I'll call this covalent_distance below).

Ideally, we should determine for each edge whether it is within a cutoff covalent_distance to decide which parameter should be used. However:

it is not immediately apparent what a good way of doing this other than traversing the graph (which is likely computationally expensive). We should find out if there is a more convenient way to determine the covalent_distance (or at least whether it's within 3). Maybe Erik Dekker knows, or maybe there are some graph theory experts out there that we can ask.
we are using a distance cutoff as a proxy for covalency, which we expect to be a reasonable predictor. However, the error gets cubed at a covalent_distance of 3 (each hop has the potential for error). This is not really an issue if the error rate is very low, but can start becoming significant if the single-hop error rate is on the order to 10%.

Alternative options that are less accurate but computationally easy:

Use a single set of van der waals parameters (probably high energy) for ALL edges, irrespective of chain identity.
- In this case we are overestimating intra-chain edges with covalent_distance <= 3.
Leave it as is, i.e. ALL intra-chain edges are considered low energy, while inter-chain edges are high energy.
- In this case, we are underestimating intra-chain edges with covalent_distance > 3.
Use covalency as a proxy for using low energy instead of the current same-chain-ness.
- In this case we are only overestimating intra-chain edges with covalent_distance == 2 or 3.
- This is kind of a compromise between the previous 2 options.
Disconsider all intra-chain vdw interactions (i.e. set them to 0).
- In this case we are underestimating ALL intra-chain interactions, but there might be a separate reason for doing this (I didn't understand why).
Use a distance-based cutoff to determine whether an edge is within a covalent_distance of 3, analogous to the way we determine whether a bond is covalent or not.
- Not sure how to determine a good cutoff value, given that a covalent_distance of 3 has a large full rotational freedom in 3D.

DaniBodor commented 1 year ago

@LilySnow and @cbaakman, can you take a look and tell me whether I accurately summarized the issue and what you think of the potential solutions I proposed or whether you can think of any others?

I actually think that given that we only need to know whether a bond is or is not within a covalent distance of 2 (@LilySnow, did I understand correctly that 3 or more is "far"?), it might be ok to "traverse the graph" one extra step, but I also kind of like my alternative solutions 3 and 5, which I only thought of after leaving.

cbaakman commented 1 year ago

I would prefer using a distance cutoff. (± 5 Å ?) Since covalent bond definitions are also distance-based.

Within that distance and intra-chain, use low energy vanderwaals parameters. Otherwise, high energy vanderwaals parameters.

LilySnow commented 1 year ago

To identify the number of hops between two nodes, we can use adjacency matrix A. For example, if we want to identify which two atoms are separated by 3 hops (3 edges), we can use A^3 (we need to remove repeated hops, of course. This can be done easily by checking A, A^2).

For rings this method may not work well. But we are only interested atoms that are separated by 1 to 3 bonds. So all rings in the amino acids are excluded, I think.

There are four numerical values in protein-allhdg5-4.param (image below). According to Alex, Column 1 and 2 are the epsilon and sigma values for atoms separated by >= 4 bonds. Columns 3 and 4 for those separated by 1 to 3 bonds. Columns 3 is almost 10 times smaller than column 1. We want to downplay the importance of Evdw for atoms seperated by 1 to 3 bonds. This is a standard think in molecular simulations.

DaniBodor commented 1 year ago

To identify the number of hops between two nodes, we can use adjacency matrix A. For example, if we want to identify which two atoms are separated by 3 hops (3 edges), we can use A^3 (we need to remove repeated hops, of course. This can be done easily by checking A, A^2).

Is what you are suggesting here the same as what we previously called "traversing the graph"? If not, it is not clear to me how to construct this matrix.

There might be out-of-the-box solutions for finding this as well. Potentially a variation of this might work.

cbaakman commented 1 year ago

So A^3 would have the same effect as traversing the graph? Find all atom pairs with at least 3 bonds between them?

DaniBodor commented 1 year ago

This should be the way to do it (copied from slack conversation):

# for distance matrix D

A = D < 2.0
A2 = np.matmul(A, A)
A3 = np.matmul(A2, A)

C = A || A2 || A3

vdw = _get_high_vdw(D)
vdw[C] = _get_low_vdw(D)[C]

DaniBodor commented 1 year ago

# for distance matrix D

A = D < 2.0
A2 = np.matmul(A, A)
A3 = np.matmul(A2, A)

C = A || A2 || A3

vdw = _get_high_vdw(D)
vdw[C] = _get_low_vdw(D)[C]
@cbaakman , very nice solution! Should be relatively straightforward to implement.

Quick question: A3 and C are identical, right? C = A || A2 || A3 does not work for me (I'm guessing this is a python 3.10 thing, and I am working with 3.9), but I think it's the same as doing C = A + A2 + A3 In my case, assert np.all(C==A3) passes.

cbaakman commented 1 year ago

Quick question: A3 and C are identical, right? C = A || A2 || A3 does not work for me (I'm guessing this is a python 3.10 thing, and I am working with 3.9), but I think it's the same as doing C = A + A2 + A3 In my case, assert np.all(C==A3) passes.

I tested it earlier and it seemed that A3 works just as well as C. I have no mathematical proof however. I'd say just use A3.

DaniBodor commented 1 year ago

I tested it earlier and it seemed that A3 works just as well as C. I have no mathematical proof however. I'd say just use A3.

Mathematically it also makes sense. If you're multiplying true*true (1*1), as we're doing to calculate A3 for residues that are only 2 hops away, you still get true. Adding any number of falses (0) to that, still leaves it as true.

LilySnow commented 1 year ago

I chatted with Alex again today. Here are some corrections for Evdw and Eelec:

We ignore atom pairs that are separated by 1 or 2 bonds. Otherwise Evdw will be very high.
For atom pairs that are separated by 3 bonds (the so-called 1-4 interactions in Molecular Dynamics), we use the last two columns of the parameter file.

This is also described here: the “Exclusions from Non-Bonded Interactions” section of Practical considerations for Molecular Dynamics

DaniBodor commented 1 year ago

What that section says is:

Pairs of atoms connected by chemical bonds are normally excluded from computation of non-bonded interactions because bonded energy terms replace non-bonded interactions. In biomolecular force fields all pairs of connected atoms separated by up to 2 bonds (1-2 and 1-3 pairs) are excluded from non-bonded interactions.

However, in deeprankcore we are not explicitly using any bonded energies, so there is nothing to replace them in our case. We do have a a feature for covalency, but that is only for the 1-2 interaction, not the 1-3 interaction. Moreover, this is a binary feature, not a continuous value, as you would expect for energies.

A few thoughts:

We should include (some of the) bonded energies (bond potential, angle potential, etc) mentioned in this article as features.
- Alternatively, can we get away with maintaining the high vdw potentials for all interactions as an approximation?
Perhaps we should replace the binary covalent feature with a categorical feature for bonding distance, with options ['1-2', '1-3', '1-4', 'far'].
Also, relevant is that until now, we had only discussed this issue in the context of VDW energy, but from the lesson (and your comment above) it seems that this applies to both vdw and electrostatic energy.
It is unclear to me how all this influences the energies we are calculating for residue level graphs. This is currently done by summing all the energies for each atom in residue 1 to each atom in residue 2. Is this still accurate?
Finally, looking at the Combining Rules, it seems that the default rule should be to use geometric means for both sigma and epsilon. However, deeprankcore is using what they call the "Lorentz–Berthelot" method, where geometric mean is used for epsilon, but the arithmetic mean is used for sigma. Any specific reason for this?

LilySnow commented 1 year ago

Thanks for these thoughts.

For bond energies, we can of course add them later.
Indeed, this issue applies to both Vdw and Eelec
Yes, it is quite standard in molecular dynamics that we sum up energies
For the combining rules, we choose the current one because we are using OPLS potentials and its equations. I know it is old... that's what haddock is using...

DaniBodor commented 1 year ago

Thanks for clarifying, Li!

Another thought occurred to me now: Is there any reason we are using the different potentials as separate features? I guess that atoms don't "experience" vdw energy separate from bond energy, separate from electrostatic energy. I guess we should have a sum energy between each atom pair that sums all the energies into a single feature. I guess for MD simulations, they can remain separate because the simulation will integrate all of that anyway (not sure if this is true, I have never done such simulations myself), but in our case if we keep them separate, they will be treated as separate entities by the network. We can of course also have each potential separate and then have an additional feature that is the sum of all of them.

In any case, to me it seems weird to remove the non-bond potentials for 1-2 and 1-3 pairs, before adding back a different potential to replace it with. If we do that, suddenly there is no potential between these pairs whatsoever.

DaniBodor commented 1 year ago

Summary of meeting of March 2nd:

We expect that potential energy of atom pairs within one chain is not relevant for PPI queries, but may be for Variant (mutation) queries
- For Variant queries, potentials of intra-residue atom pairs are probably still not relevant.
We would like a single graph architecture for all queries, so we decided to keep all edges and features intact as they are now.
- This includes keeping vdw and electrostatic energy as separate features, because these could be weighted differently in different situations
- We could later implement a setting (or ask users to) to remove all intra-chain edges prior to training, as this will significantly decrease the network size and therefore speed up training and/or improve performance.
We want to remove all non-bond energies for 1-2 and 1-3 atoms pairs, as our calculations are not appropriate to calculate these
- These will be set to 0.
- Li, please remind me how to handle 1-4 pairs? Set them to 0 as well? Or use the low-energy parameters? Or use the same parameters as we do for the everything else.
- We can consider adding bonding energies as a feature in the future. However, this is not trivial to implement, as the angle and dihedral potentials take more than 2 atoms into account.
- One suggestion to deal with this is to divide the energy up over all atoms and assign them as node features onto the atoms rather than as edge features between them
Multiplying the adjacency matrix multiple times to identify 1-3 and 1-4 partners is slow (~12-15 sec on CPU of personal laptop), so we decided to use a distance cutoff as a proxy for such pairings.
- This will be an imperfect approximation, as sometimes atoms can be close by in euclidean distance but not in "pairing distance"
- I will do some tests to estimate what a decent distance cutoff for this is, using the A3 adjacency matrix for one or two structures
- We may (likely?) revisit this in the future to optimize the matrix calculation, e.g. by doing it on the GPU.

LilySnow commented 1 year ago

For 1-4 pairs, we could use the low energy parameters as we do now. Let's keep our modification as simple as possible as I think we may have to revisit this issue later

DaniBodor commented 1 year ago

For 1-4 pairs, we could use the low energy parameters as we do now. Let's keep our modification as simple as possible as I think we may have to revisit this issue later

The simplest solution would be to use high energy parameters for 1-4 pairs, because currently we are using intra-residue status as a proxy. The second simplest solution would be a (newly defined) cutoff distance for this as well.

DaniBodor commented 1 year ago

Below are the results from my cutoff distance test. The last one (1ATN_1w) is the best to look at, largest and most complex structures. Code for this can be found in './tests/features/determine_contact_cutoffs.ipynb' of this branch

My gut feeling is that cutoff of 3.6/4.2 A cutoff distances for 1-3/1-4 pairing are the best compromise for including the vast majority of true positives while minizing false positives, despite if 2.9/3.6 give slightly higher f1 score.

DaniBodor commented 1 year ago

I wonder whether we can use PDB nomenclature to define adjacency instead of using the adjacency matrix. In this way, anything that is not in the same or neighboring residues would always be >1-4 pairing. Within one residue distance, use nomenclature of CA, CB, etc, and convert this to a bond distance. Similar rules can be applied to adjacent residues, where only few options exist to be within a 1-4 pairing distance.

The same logic could be used for applying the covalent feature, instead of the current distance cutoff.

(I made the same comment here)

DeepRank / deeprank2

feat: redefine potential energy terms between atom pairs, depending on bond distance #357