distance_dir::apply_force seems incorrect

HanatoK commented 1 month ago

While testing https://github.com/Colvars/colvars/pull/713 I find that the forces in distance_dir::apply_force seems incorrect (I compare it to pytorch). The numerically correct results seems to be

if (!group1->noforce)
    group1->apply_force(-1.0 * force_tang / (0.5 * dist_v.norm()));

  if (!group2->noforce)
    group2->apply_force(       force_tang / (0.5 * dist_v.norm()));

I know the forces are calculated by some sorts of projection. I don't know why the modified code above gives me the correct numerical results, but the missing of dist_v.norm() in the function looks a bit strange. I need some time to figure it out.

HanatoK commented 1 month ago

@giacomofiorin After digging into I still don't understand how distanceDir should work. The dist2_lgrad and dist2_rgrad do not multiply by 2.0. Why?

cvm::real colvar::distance_dir::dist2(colvarvalue const &x1,
                                      colvarvalue const &x2) const
{
  return (x1.rvector_value - x2.rvector_value).norm2();
}

colvarvalue colvar::distance_dir::dist2_lgrad(colvarvalue const &x1,
                                              colvarvalue const &x2) const
{
  return colvarvalue((x1.rvector_value - x2.rvector_value), colvarvalue::type_unit3vectorderiv);
}

colvarvalue colvar::distance_dir::dist2_rgrad(colvarvalue const &x1,
                                              colvarvalue const &x2) const
{
  return colvarvalue((x2.rvector_value - x1.rvector_value), colvarvalue::type_unit3vectorderiv);
}

If they are multiplied by 2.0, then -force_tang / dist_v.norm() and force_tang / dist_v.norm() should be used for group1 and group2 forces, respectively. You can try the following pytorch script and compare what Colvars does:

#!/usr/bin/env python3
import torch

group1 = torch.tensor([1, 2, 3], dtype=torch.float64, requires_grad=True)
group2 = torch.tensor([-1, -2, -3], dtype=torch.float64, requires_grad=True)

dist_v = group2 - group1
dist_dir = dist_v / torch.linalg.norm(dist_v)

center = torch.tensor([-0.5, 1.0, 0.6], dtype=torch.float64)
center /= torch.linalg.norm(center)

# Force from pytorch (-dV/dx)
energy = 0.5 * 10.0 * torch.sum(torch.square(dist_dir - center))
energy.backward()
print('force1:\n', -group1.grad)
print('force2:\n', -group2.grad)

# What Colvars does
cv_force = -10.0 * (dist_dir - center)
iprod = torch.sum(cv_force * dist_dir)
force_tang = cv_force - iprod * dist_dir

print('force_tang1:\n', -force_tang)
print('force_tang2:\n',  force_tang)

# These should match force1 and force2
print('force_tang1 / dist_v.norm():\n', -force_tang / (torch.linalg.norm(dist_v)))
print('force_tang2 / dist_v.norm():\n',  force_tang / (torch.linalg.norm(dist_v)))

HanatoK commented 1 month ago

@giacomofiorin I think the force projection on distanceDir can be illustrated as follows: distance_dir

force_tang is actually the force on the original vector $\vec{v}$, and to project the force further to $\vec{v}/|\vec{v}|$, you have to scale the force too.

HanatoK commented 1 month ago

Hi @giacomofiorin and @jhenin , I am trying to fix this issue but then I just find I may misunderstand what unit vectors in Colvars are. According to the code in colvarvalue.h: https://github.com/Colvars/colvars/blob/e1220bd61c3508eaac7f9d44bd4e16e95aed2142/src/colvarvalue.h#L738-L741 The distance between two unit 3D vectors are the angle between them, but I don't know why it is defined in such way. In general, people may expect that two unit 3D vectors are just two 3D vectors, and the distance between them are $|\mathbf{u}_1 - \mathbf{u}_2|$.

If the Colvars defines the unit vector in a different definition that is uncommon, then it should be documented to avoid people misunderstand what distanceDir is.

HanatoK commented 1 month ago

Hi @giacomofiorin and @jhenin , I am trying to fix this issue but then I just find I may misunderstand what unit vectors in Colvars are. According to the code in colvarvalue.h:

https://github.com/Colvars/colvars/blob/e1220bd61c3508eaac7f9d44bd4e16e95aed2142/src/colvarvalue.h#L738-L741

The distance between two unit 3D vectors are the angle between them, but I don't know why it is defined in such way. In general, people may expect that two unit 3D vectors are just two 3D vectors, and the distance between them are | u 1 − u 2 | .

If the Colvars defines the unit vector in a different definition that is uncommon, then it should be documented to avoid people misunderstand what distanceDir is.

Should I conclude that the Colvars definition of distance between two unit 3D vectors is the square of the angles between them? Because then the derivative dist2_grad in colvarvalue.cpp doesn't make sense to me: https://github.com/Colvars/colvars/blob/e1220bd61c3508eaac7f9d44bd4e16e95aed2142/src/colvarvalue.cpp#L603-L610

I compute the gradient with respect to x2 as follows:

$$ d^2 = \arccos^2 \left(\mathbf{u}_1\cdot\mathbf{u}_2\right) $$

$$ \begin{split} \frac{\partial d^2}{\partial \mathbf{u}_2}&=\frac{\partial}{\partial \mathbf{u}_2} \arccos^2 \left(\mathbf{u}_1\cdot\mathbf{u}_2\right)\ &=2\arccos \left(\mathbf{u}_1\cdot\mathbf{u}_2\right) \left(-\frac{1}{\sqrt{1-(\mathbf{u}_1\cdot\mathbf{u}_2)^2}}\right)\frac{\partial (\mathbf{u}_1\cdot\mathbf{u}_2)}{\partial \mathbf{u}_2}\ &=2\arccos \left(\mathbf{u}_1\cdot\mathbf{u}_2\right) \left(-\frac{1}{\sqrt{1-(\mathbf{u}_1\cdot\mathbf{u}_2)^2}}\right)\mathbf{u}_1 \end{split} $$

I write the above formula in python and compare it with what Colvars does and the pytorch symbolic gradient:

#!/usr/bin/env python3
import torch
import numpy as np

np.random.seed(1)
# Two random vectors
v1 = np.random.normal(size=(3))
v2 = np.random.normal(size=(3))
# Two unit vectors in pytorch
u1 = torch.tensor(v1 / np.linalg.norm(v1), requires_grad=True)
u2 = torch.tensor(v2 / np.linalg.norm(v2), requires_grad=True)

print("u1 = ", u1, " u1 norm = ", u1.norm())
print("u2 = ", u2, " u2 norm = ", u2.norm())

# How Colvars calculates dist2
dist2 = torch.acos(u1.dot(u2)) * torch.acos(u1.dot(u2))

# Reference symbolic gradient
dist2.backward(retain_graph=True)
print("Reference:\n", u2.grad)

# My derivation of dist2_grad
def dist2_grad_me(u1, u2):
    dot_prod = u1.dot(u2)
    return 2.0 * torch.acos(dot_prod) * (-1.0 / (torch.sqrt(1 - dot_prod * dot_prod))) * u1

print("My derivation of dist2_grad:\n", dist2_grad_me(u1, u2))

# How Colvars calculates dist2_grad
def dist2_grad_colvars(u1, u2):
    cos_t = u1.dot(u2)
    return 2.0 * (cos_t * u1 - u2)

print("Colvars implementation of dist2_grad:\n", dist2_grad_colvars(u1, u2))

# Colvars 2013
def dist2_grad_colvars_2013(u1, u2):
    cos_t = u1.dot(u2)
    sin_t = torch.sqrt(1.0 - cos_t * cos_t)
    dx = 2.0 * sin_t * ((-1.0) * sin_t * u2[0] + cos_t / sin_t * (u1[0] - cos_t * u2[0]))
    dy = 2.0 * sin_t * ((-1.0) * sin_t * u2[1] + cos_t / sin_t * (u1[1] - cos_t * u2[1]))
    dz = 2.0 * sin_t * ((-1.0) * sin_t * u2[2] + cos_t / sin_t * (u1[2] - cos_t * u2[2]))
    return torch.stack((dx, dy, dz))

print("Colvars (2013) implementation of dist2_grad:\n", dist2_grad_colvars_2013(u1, u2))

Running the code above, I get:

u1 =  tensor([ 0.8953, -0.3372, -0.2911], dtype=torch.float64, requires_grad=True)  u1 norm =  tensor(1.0000, dtype=torch.float64, grad_fn=<LinalgVectorNormBackward0>)
u2 =  tensor([-0.3999,  0.3226, -0.8579], dtype=torch.float64, requires_grad=True)  u2 norm =  tensor(1.0000, dtype=torch.float64, grad_fn=<LinalgVectorNormBackward0>)
Reference:
 tensor([-3.2828,  1.2364,  1.0674], dtype=torch.float64)
My derivation of dist2_grad:
 tensor([-3.2828,  1.2364,  1.0674], dtype=torch.float64,
       grad_fn=<MulBackward0>)
Colvars implementation of dist2_grad:
 tensor([ 0.4112, -0.4988,  1.8422], dtype=torch.float64,
       grad_fn=<MulBackward0>)
Colvars (2013) implementation of dist2_grad:
 tensor([ 0.4112, -0.4988,  1.8422], dtype=torch.float64,
       grad_fn=<StackBackward0>)

So I have no idea about what Colvars really want to compute. @giacomofiorin and @jhenin Do you have any ideas?

Update: I have just checked out the initial Colvars code (in https://github.com/Colvars/colvars/blob/0c530e2df0a7658476aaecb3fb578180199b3db5/src/colvarvalue.h#L674-L681) but that is still not what I expect...

giacomofiorin commented 1 month ago

Hi @HanatoK! I think that your conclusion about the gradients is correct, i.e. they are missing a normalization.

This makes sense by observing that the force applied onto a distanceDir CV is a torque. Therefore, when the distance between group1 and group2 increases, but the torque remains the same, the force applied onto group1 and group2 should decrease accordingly.

I looked at very old local code snapshots, preceding the initial commit in this repo. Unlike for the quaternion bug (https://github.com/Colvars/colvars/pull/713#issuecomment-2346452299) I could not find a specific commit that introduced the error, or notes detailing the formal derivation. However, redoing the derivation anew and checking the physical dimensions of the force being applied to the centers of mass groups confirms what you found.

(The current code takes a number that has the unit of energy or torque and applies it as a force: your correction divides it by a length, thus recovering a unit of force.)

This is a good example of the fact that regression tests alone cannot prevent all bugs. This is especially true for very old code that was also used less frequently... Neither @jhenin or I came across a project with a strong motivation to compute a PMF along on this kind of variable, which would have required implementing grids on the unit sphere.

As you also found later, the distance between two distanceDir values is the angle between them (geodesic distance between two points on Riemannian manifold). This could be documented for clarity, along with double-checking the gradient of that angle. That said, I would also be in favor of adding a component that allows that angle to be used directly as a CV?

HanatoK commented 1 month ago

distanceDir

Originally the distance between two distanceDir should be the angle, but later in commit 604bc14946f8f0e495a4f08ee13ec4081addd4a8 you added dist2 for it, which is $||\mathbf{v}_1-\mathbf{v}_2||^2$, but then the dist2_lgrad and dist2_rgrad didn't have the factor 2.0. I could only guess that you found something wrong and wanted to fix it, but I could not infer any details (what you actually encountered and wanted to fix) from the commit itself...

giacomofiorin commented 1 month ago

distanceDir

Originally the distance between two distanceDir should be the angle, but later in commit 604bc14 you added dist2 for it, which is | | v 1 − v 2 | | 2 , but then the dist2_lgrad and dist2_rgrad didn't have the factor 2.0. I could only guess that you found something wrong and wanted to fix it, but I could not infer any details (what you actually encountered and wanted to fix) from the commit itself...

This was probably related to a different bug, reported by JC Gumbart and Anthony Hazel. The bug made the distance_dir function revert to the metric functions of the base class, which are defined for scalar numbers and therefore the code would crash like this:

colvars: Trying to assign a colvar value with type "scalar number" to one with type "3-dimensional unit vector"

Clearly, my fix was done in too much of a hurry.

HanatoK commented 1 month ago

distanceDir

Originally the distance between two distanceDir should be the angle, but later in commit 604bc14 you added dist2 for it, which is | | v 1 − v 2 | | 2 , but then the dist2_lgrad and dist2_rgrad didn't have the factor 2.0. I could only guess that you found something wrong and wanted to fix it, but I could not infer any details (what you actually encountered and wanted to fix) from the commit itself...

This was probably related to a different bug, reported by JC Gumbart and Anthony Hazel. The bug made the distance_dir function revert to the metric functions of the base class, which are defined for scalar numbers and therefore the code would crash like this:
colvars: Trying to assign a colvar value with type "scalar number" to one with type "3-dimensional unit vector"
Clearly, my fix was done in too much of a hurry.

Oh, I see! I saw them at the NAMD Workshop in Dalian in 2017. I asked them about the GB1 system in their paper (https://doi.org/10.1063/1.5025951), and wanted to use the same system and CHARM22* with meta-eABF to show the strength of meta-eABF (unfortunately I failed so it was not appeared in the paper). During the workshop they told me that they kept the reversible unfolding of GB1 only along the z direction, so that they could reduce the size of the water box, and then made the simulation faster since the number of atoms was smaller. To do this they used a distanceDir CV between the end-to-end distance, and restrained it to the z-axis (the following was what I used but I believe they used something similar):

colvar {
    name restraintZ
    distanceDir {
        group1 {atomNumbers {9}}
        group2 {atomNumbers {238}}
    }
}

harmonic {
    colvars restraintZ
    outputEnergy on
    forceConstant 5.0
    centers (0.0, 0.0, 1.0)
}

If the force is incorrect then the dynamics may be affected, though.

giacomofiorin commented 1 month ago

Oh, I see! I saw them at the NAMD Workshop in Dalian in 2017. I asked them about the GB1 system in their paper (https://doi.org/10.1063/1.5025951), and wanted to use the same system and CHARM22* with meta-eABF to show the strength of meta-eABF (unfortunately I failed so it was not appeared in the paper). During the workshop they told me that they kept the reversible unfolding of GB1 only along the z direction, so that they could reduce the size of the water box, and then made the simulation faster since the number of atoms was smaller. To do this they used a distanceDir CV between the end-to-end distance, and restrained it to the z-axis (the following was what I used but I believe they used something similar):
...
If the force is incorrect then the dynamics may be affected, though.

Ah, good to know where they used this feature. I had previously thought that they used in that work a distanceXY variable between the two chain ends (i.e. force their XY coordinates to match each other).

If the distance_dir gradients were lacking the division by the length (per your correction) then the restraint forces along the XY plane would have maintained their original magnitude (i.e. without decreasing) as the peptide chain unfolded. So essentially the effect of the restraint would have been very similar to that of distanceXY after all?

HanatoK commented 1 month ago

Oh, I see! I saw them at the NAMD Workshop in Dalian in 2017. I asked them about the GB1 system in their paper (https://doi.org/10.1063/1.5025951), and wanted to use the same system and CHARM22* with meta-eABF to show the strength of meta-eABF (unfortunately I failed so it was not appeared in the paper). During the workshop they told me that they kept the reversible unfolding of GB1 only along the z direction, so that they could reduce the size of the water box, and then made the simulation faster since the number of atoms was smaller. To do this they used a distanceDir CV between the end-to-end distance, and restrained it to the z-axis (the following was what I used but I believe they used something similar):
...
If the force is incorrect then the dynamics may be affected, though.
Ah, good to know where they used this feature. I had previously thought that they used in that work a distanceXY variable between the two chain ends (i.e. force their XY coordinates to match each other).

If the distance_dir gradients were lacking the division by the length (per your correction) then the restraint forces along the XY plane would have maintained their original magnitude (i.e. without decreasing) as the peptide chain unfolded. So essentially the effect of the restraint would have been very similar to that of distanceXY after all?

I am not sure. What they did was using distance as a CV for the PMF calculation, and then restraining the direction of the same distance vector to z axis.

giacomofiorin commented 1 month ago

@HanatoK The description of this issue contains an extra 1/2 at the denominator compared to the one that matches the PyTorch automatic differentiation right after (link).

The latter is what I get if I differentiate independently: (the second term can be neglected if it is multiplied by a "force" on the unit vector, which is tangential to the unit sphere and thus orthogonal to the vector itself). Does it check out for you?

HanatoK commented 1 month ago

@HanatoK The description of this issue contains an extra 1/2 at the denominator compared to the one that matches the PyTorch automatic differentiation right after (link).

The latter is what I get if I differentiate independently: (the second term can be neglected if it is multiplied by a "force" on the unit vector, which is tangential to the unit sphere and thus orthogonal to the vector itself). Does it check out for you?

Actually there are two slightly different issues. The first one is the length of the vector missing in the denominator. The second one is the 2.0 factor missing in the dist2_lgrad and dist2_rgrad functions. Originally I only noticed the first one, so I thought there should be a 0.5 in the denominator, but latter I noticed the missing 2.0 in the dist2_lgrad and dist2_rgrad functions, so 0.5 is not necessary.

I think your derivation and the fix in https://github.com/Colvars/colvars/pull/724 should be correct.

Colvars / colvars

distance_dir::apply_force seems incorrect #714