Frechet distance returns larger value for 2D array

MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.

https://mdanalysis.org

Other

1.28k stars 644 forks source link

Frechet distance returns larger value for 2D array #1385

Closed cjekel closed 7 years ago

cjekel commented 7 years ago

Expected behaviour

Frechet distance should work with 2D numpy arrays as it works with 3D numpy arrays

Actual behaviour

Frechet distance is returning larger value for 2D numpy arrays

Code to reproduce the behaviour

import numpy as np
import MDAnalysis.analysis.psa as psa

n=100
x = np.random.random(n)
y = np.random.random(n)
xN = np.random.random(n)
yN = np.random.random(n)

#   2D example
P = np.zeros([len(x),2])
Q = np.zeros([len(xN),2])
P[:,0] = x
P[:,1] = y
Q[:,0] = xN
Q[:,1] = yN

FD2d = psa.discrete_frechet(P,Q)

#   3D example
P = np.zeros([len(x),3])
Q = np.zeros([len(xN),3])
P[:,0] = x
P[:,1] = y
Q[:,0] = xN
Q[:,1] = yN

FD3d = psa.discrete_frechet(P,Q)

print('2D Frechet Distance:',FD2d)
print('3D Frechet Distance:',FD3d)

Sample code output: ('2D Frechet Distance:', 1.1872308520775383) ('3D Frechet Distance:', 0.96936993149322093)

Currently version of MDAnalysis:

0.16.1

kain88-de commented 7 years ago

@sseyler Could you have a look at this.

tylerjereddy commented 7 years ago

Those inputs are not the same. Both are 2D numpy arrays with differing numbers of columns. The number of atoms will be interpreted as path.shape[1] / 3, so each case has a different number of atoms, and the mess in the comparison will propagate from there. I can imagine the zero-filled values in the examples above being interleaved during this 'flattening' of the input during interpretation by get_coord_axes. I'm also thinking that technically one of these examples would be interpreted as having a non-integer number of atoms (2/3), so an exception should perhaps be raised in that case.

Confused yet? I think the API / docs could perhaps use some improvement on this matter, but it is tricky beacause the manuscript emphasizes the ability to compare multiple dimensions (not just track a single particle per frame, etc.).

I'd prefer to focus my efforts on a scipy interpretation of Frechet in collaboration with Sean and Oli, but I'm happy to review any improvements @sseyler makes here. Note that I also wrote an MDA unit test for discrete frechet while I was writing scipy.spatial.distance.directed_hausdorff, and that test code seems to be passing when it should not be--it should suffer the same issue as the user-provided input here, but for some reason the expected value is retained--perhaps because of the simplicity of the circular data, but not sure. The (100,3) data shape should probably be reformulated to a single frame (1,100,3) for both P and Q in that test, so that each row of the timeframe is a coordinate in the circle with z = 0.

Note that both hausdorff and frechet docstrings describe "3N" input coordinates, so I wouldn't necessarily expect "2N" support, and I don't think there is because of the quotient value in get_coord_axes I've described above (though I think a scipy version should support N-dimensional I'm less convinced in an MD-only context [though arbitrary analysis parameters may be useful as emphasized in paper]--the problem here is the potential for confusion when you think you are providing a planar 2D input). The data structures for both include timesteps so the example docstring for Frechet has P with shape: (49, 214, 3).

That's 49 timesteps, each with the positions of 214 atoms in 3 dimensions. You have to be really really careful when thinking about your input data structure for these shape analysis codes.

You could consider adding some kind of warning for the user when using 2D numpy arrays, (x,y) even if y is 3--a bit more explicit notification about how that data is being interpreted as referring to 3D Cartesian coords.

tylerjereddy commented 7 years ago

So the unit test I wrote is probably passing because it is assumed to have 1 atom (path.shape[1] / 3) with 100 coordinate values, from the original (100,3) data structures for P and Q. So it is kind of like two circular (single atom?) proteins in a single timestep when I simply meant to calculate the discrete Frechet distance between two sets of points outside of any kind of specific MD concept--i.e., 100 3-D points in the trajectories of two different single particles. The two scenarios are equivalent in discrete Frechet result, but I maintain that there's at least some potential source for confusion here.

The test might then be adjusted to have shape (100,1,3) for P and Q, though (1,100,3) may give the same result despite the different functional meaning. Those should theoretically give the same result since the static vs. dynamic distinction (moving single atom vs. static protein with same coords as that moving single atom) doesn't matter.

sseyler commented 7 years ago

I'm happy to help generalize the code to work for N dimensions; the limitation to 3D coordinates was indeed due to the fact that MD trajectories (e.g., PDB files) have 3D coordinate information. As @tylerjereddy said, it makes more sense in the context of scipy to support N-dimensional Frechet distance measurements. However, the more general case would still be useful if one wanted, for instance, to measure the Frechet distance between trajectories in a collective variable space.

cjekel commented 7 years ago

Just to clarify 2D data is supported, as mentioned in discrete_frechet() docstring

    #:meth:`MDAnalysis.core.AtomGroup.AtomGroup.coordinates`). *P* (*Q*) has
    #either shape |3Dp| (|3Dq|), or :|2Dp| (|2Dq|) in flattened form.

2D data works as expected if I format the example correctly. Either as (100,1,3) or (1,100,3). New correctly formatted example.

# Example: Does shape (100,1,3) gives same results as (100,1,2) ?
# I confirmed that  (1,100,3) == (1,100,2) , but not shown here
import numpy as np
import MDAnalysis.analysis.psa as psa

n=100
x = np.random.random(n)
y = np.random.random(n)
xN = np.random.random(n)
yN = np.random.random(n)

#   2D example
P = np.zeros([n,1,2])
Q = np.zeros([n,1,2])
P[:,0,0] = x
P[:,0,1] = y
Q[:,0,0] = xN
Q[:,0,1] = yN

FD2d = psa.discrete_frechet(P,Q)

#   3D example
P = np.zeros([n,1,3])
Q = np.zeros([n,1,3])
P[:,0,0] = x
P[:,0,1] = y
Q[:,0,0] = xN
Q[:,0,1] = yN

FD3d = psa.discrete_frechet(P,Q)

print('2D Frechet Distance:',FD2d)
print('3D Frechet Distance:',FD3d)

Example output: ('2D Frechet Distance:', 0.67016263500103246) ('3D Frechet Distance:', 0.67016263500103246)

So I mislabeled this as a 2D vs 3D problem.

The real issue

Should the Frechet distance of (1,100,3) be equivalent to (100,1,3) ?

I expect them to be the same, but this is not the case.

import numpy as np
import MDAnalysis.analysis.psa as psa

n=100
x = np.random.random(n)
y = np.random.random(n)
xN = np.random.random(n)
yN = np.random.random(n)

#   Dynamic example
P = np.zeros([n,1,3])
Q = np.zeros([n,1,3])
P[:,0,0] = x
P[:,0,1] = y
Q[:,0,0] = xN
Q[:,0,1] = yN

dynamic = psa.discrete_frechet(P,Q)

#   static example
P = np.zeros([1,n,3])
Q = np.zeros([1,n,3])
P[0,:,0] = x
P[0,:,1] = y
Q[0,:,0] = xN
Q[0,:,1] = yN

static = psa.discrete_frechet(P,Q)

print('Dynaimc (100,1,3) Frechet Distance:',dynamic)
print('Static (1,100,3) Frechet Distance:',static)

Example code output: ('Dynaimc (100,1,3) Frechet Distance:', 0.78808681500661659) ('Static (1,100,3) Frechet Distance:', 0.55548976903365455)

cjekel commented 7 years ago

Digging deeper.

Recursive function calls hurt my head. I modified psa.py to keep track of the number of recursive calls.

In my above example, the shape of (100,1,3) calls c(i,j) 29602 times (recursively). While the shape of (1,100,3) only calls c(i, j) 1 time.

*edit: I just wanted to say thanks to everyone working on this. I became aware of MDAnalysis through @tylerjereddy 's recent 2017 PyCon talk https://www.youtube.com/watch?v=ETJc3NfU9aA which referenced the PSA paper by @sseyler .

tylerjereddy commented 7 years ago

There is support for 2D numpy array input with shape: (num_frames, 3 * n_atoms) that contains flattened [x1,y1,z1, ..., xN,yN,zN] data, but you should still have all 3 Cartesian dimensions.

It is effectively just a happy accident that when you use 3D numpy array input (n_frames, n_atoms, n_dimensions), there just so happens to be a correct result when n_dimensions is 2 instead of 3, and looking at the docstring for get_coord_axes, one of the major functions used in the Frechet calculation:

def get_coord_axes(path):
    """Return the number of atoms and the axes corresponding to atoms
    and coordinates for a given path.

    The `path` is assumed to be a :class:`numpy.ndarray` where the 0th axis
    corresponds to a frame (a snapshot of coordinates). The :math:`3N`
    (Cartesian) coordinates are assumed to be either:

    1. all in the 1st axis, starting with the x,y,z coordinates of the
       first atom, followed by the *x*,*y*,*z* coordinates of the 2nd, etc.
    2. in the 1st *and* 2nd axis, where the 1st axis indexes the atom
       number and the 2nd axis contains the *x*,*y*,*z* coordinates of
       each atom.

If you had decided to use a condensed 2D numpy array with 2D coord data you would get the wrong answer, but a 3D numpy array with 2D coord data just happens to work. You can see where the accounting would go wrong with a 2D numpy array + 2D coord data by looking at how atoms are accounted for in the source of the above function:

    path_dimensions = len(path.shape)
    if path_dimensions == 3:
        N = path.shape[1]
        axis = (1,2) # 1st axis: atoms, 2nd axis: x,y,z coords
    elif path_dimensions == 2:
        # can use mod to check if total # coords divisible by 3
        N = path.shape[1] / 3
        axis = (1,) # 1st axis: 3N structural coords (x1,y1,z1,...,xN,xN,zN)
    else:
        err_str = "Path must have 2 or 3 dimensions; the first dimensions (axis"\
                + " 0) must correspond to frames, axis 1 (and axis 2, if"       \
                + " present) must contain atomic coordinates."
        raise ValueError(err_str)
return N, axis

Notice how the numpy array data shape affects the accounting for the number of atoms in a way that assumes 3 Cartesian dimensions. Yes, you can circumvent this by using a 3D numpy array where no quotient of 3 is applied, but since both input formats are supposed to be interchangeable that's hardly something that would strike me as "intended support." 2D in the discrete Frechet docstring refers to the numpy array shape, not the number of Cartesian dimensions in the coordinates, which is admittedly somewhat confusing.

I'll try to take a look at the 'real issue' described.

tylerjereddy commented 7 years ago

The static vs. dynamic issue perhaps means that one of the approaches isn't supported (though the basis for the distinction isn't clear to me either). Intuitively (and I think closer to the docstrings) we're trying to aim at dynamics I think, but I admit I don't immediately see the limitation in the source code, if it is there. One suggestion I have is to write some simple unit tests to probe this issue a bit further to figure out what is going on--for example, two concentric circles have an obvious Frechet distance (delta radii) and could perhaps be used instead of random numbers so that you have an actual ground truth.

I suspect the @MDAnalysis/coredevs would welcome a PR with some unit testing code that clears this up & perhaps guides some refinement of docstrings if needed, etc. For a starting point, the one test I wrote for Frechet with concentric circles probably suffers from the same confusion about 2D / 3D arrays discussed here (but it still passes...), and the testing should likely be expanded to include both 2D & 3D numpy array shapes for 3D input coords.

I'd certainly agree that this can be confusing.

cjekel commented 7 years ago

Okay. So I think my confusion originates from how discrete_frechet handles N atoms. Whenever I input 2 or more atoms for P and/or Q I get weird results. However the Frechet distance works as expected when I use discrete_frechet to calculate the distance between two paths for a single atom.

I created an example demonstrating this behavior. P represents the original paths of two atoms. Q represents the modified path of two atoms. Calling discrete_frechet(P,Q) gives me a value in between the offsets of the two atoms. Calling discrete_frechet(P[:,0,:],Q[:,0,:]) give me the expected distance between paths for atom 0. Calling discrete_frechet(P[:,1,:],Q[:,1,:]) give me the expected distance between paths for atom 1.

import numpy as np
import MDAnalysis.analysis.psa as psa
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

#   same radius offset, different Z offset, 2 atoms each

n=100 #  nubmer of arbitraty time events

thetaMin = 0.0 # min theta for circle path
thetaMax = np.pi*0.5 # max theta for circle path
#   constants for manipulation
#   z -> constant Z offset of path
#   r -> Raidius of path
z0 = -5.0
z1 = 0.0
r0 = 1.0
r1 = 10.0

nAtoms = 2  # number of atoms
#   initite P and Q as zeros
P = np.zeros([n,nAtoms,3]) # path 0
Q = np.zeros([n,nAtoms,3]) # path 1

#   propgate theta from min to max
thetas = np.linspace(thetaMin,thetaMax,n)

#   X values
P[:,0,0] = np.cos(thetas)*r0 # atom 0 path 0
P[:,1,0] = np.cos(thetas)*r1 # atom 1 path 0
Q[:,0,0] = np.cos(thetas)*r0 # atom 0 path 1
Q[:,1,0] = np.cos(thetas)*r1 # atom 1 path 1

#   Y values
P[:,0,1] = np.sin(thetas)*r0 # atom 0 path 0
P[:,1,1] = np.sin(thetas)*r1 # atom 1 path 0
Q[:,0,1] = np.sin(thetas)*r0 # atom 0 path 1
Q[:,1,1] = np.sin(thetas)*r1 # atom 1 path 1

#   Z Values
P[:,:,2] = z0
P[:,0,2] = z0-5.0 # set atom 0, path 0 5.0 bellow atom 1
Q[:,:,2] = z1

#   plot the two atoms and their paths
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot(P[:,0,0], P[:,0,1], P[:,0,2], label='Atom 0, Path 0')
ax.plot(Q[:,0,0], Q[:,0,1], Q[:,0,2], label='Atom 0, Path 1')

ax.plot(P[:,1,0], P[:,1,1], P[:,1,2], label='Atom 1, Path 0')
ax.plot(Q[:,1,0], Q[:,1,1], Q[:,1,2], label='Atom 1, Path 1')

ax.legend()
plt.show()

d0 = psa.discrete_frechet(P,Q)
d1 = psa.discrete_frechet(P[:,0,:], Q[:,0,:])
d2 = psa.discrete_frechet(P[:,1,:], Q[:,1,:])

print('Path offset atom 0:', np.abs(P[0,0,2]-Q[0,0,2]))
print('Path offset atom 1:', np.abs(P[0,1,2]-Q[0,1,2]))
print('psa.discrete_frechet(P,Q) ', d0)
print('psa.discrete_frechet(P[:,0,:], Q[:,0,:]) ', d1)
print('psa.discrete_frechet(P[:,1,:], Q[:,1,:]) ', d2)

('Path offset atom 0:', 10.0) ('Path offset atom 1:', 5.0) ('psa.discrete_frechet(P,Q) ', 7.9056941504209481) ('psa.discrete_frechet(P[:,0,:], Q[:,0,:]) ', 10.0) ('psa.discrete_frechet(P[:,1,:], Q[:,1,:]) ', 5.0)

With these results, it doesn't appear that N atoms are supported. Rather discrete_frechete should only be calculated for 1 atom at a time.

@sseyler Is my understanding correct? If so I'd be happy to propose a modified docstring to clarify all of this, while providing an additional super simple example.

sseyler commented 7 years ago

@cjekel Thanks a lot for the diagnosis. I see where your confusion is coming from, and I'm going to go through the docs to get a better idea of what can be changed for clarity.

The main thing to keep in mind is that the 0th NumPy "coordinate" (i.e., dimension) corresponds to the number of frames, while the 1st (and 2nd, if present) dimensions correspond to atomic coordinates. The idea is the the NumPy array dimensions should go in order of increasing "resolution", which is why the frames come first, while the individual (x, y, z) coordinates come last. Since the Frechet distance depends on the manner in which the paths (i.e., frames in a trajectory) are traversed, the comparison between a 100-atom, single frame trajectory and 100-frame, single atom should indeed be different, at least in the general case. Only the 2nd NumPy dimension (if present) should be combined with the 1st in the case of "flattening" the atomic coordinates; the number of frames, P (or Q) needs to be specified independently of the total number of atomic coordinates:

N x D,

where N is the number of atoms and D is the number of dimensions in which each atom resides.

However, given only paths consisting of a single frame, we are dealing with a total number of coordinates (N x D) that depends only on the number of atoms, N, and the number of dimensions, D, in which each atom resides. Since the Frechet (and Hausdorff) distance requires a "point-wise" metric that defines a notion of distance between individual atoms—we use the RMSD, which averages the Euclidean distance of N atoms—results will differ in cases where the 1st (and 2nd) dimensions don't match the total number of atomic coordinates (N x D). This is because, for instance, the RMSD between 3D atomic structures is computed as Euclidean distance of 3N coordinates (x, y, and z for each atoms) averaged over N atoms; for 2D, the average would still be over N atoms, but there are only x and y coordinates for each atom. Thus results for a 2D path will only be the same as those for a degenerate 3D path if the z coordinate, for example, is explicitly as "0" for each atom.

I'll respond soon with more information that will hopefully clarify how the input "paths" should be structured. Thanks again for the input!

sseyler commented 7 years ago

@cjekel So, I took your code and ran it to get a better feel for the (atomic) paths you constructed. Just using the 3D plot of the paths (without looking at the code), I calculated the values I expected for discrete Frechet (that I'm calling df below) between:

1) Path P and Q for atom 0 only: df(P[:,0,:], Q[:,0,:]) --> sqrt((-10 - 0)**2 / 1) = sqrt(100) = 10 2) Path P and Q for atom 1 only: df(P[:,1,:], Q[:,1,:]) --> sqrt((-5 - 0)**2 / 1) = sqrt(25) = 5 1) Path P and Q for atoms 0 and 1: df(P, Q) --> sqrt( ((-10 - 0)**2 + (-5 - 0)**2) / 2) = sqrt( (100 + 25) / 2 ) ~ 7.91

Note that in (1) and (2) above, I'm dividing by N = 1 inside the square root, because the rmsd (our default point-wise metric) will be calculated as an average over one atom. In (3), where we calculate Frechet for 2-atom paths, I divide by N = 2 inside the square root, because the average is over 2 atoms. If, for instance, we used the Euclidean distance as our point-wise metric instead, we would get the same results for (1) and (2), but for (3) we would get sqrt(125) ~ 11.18 (since there is no division by 2 anymore).

These are the same results that you got above (and that I got after running your code), so I think there might be a discrepancy between the correct results and the results you're expecting? Let me know if this helps clear anything up! I'm still thinking about the docs in the meantime.

cjekel commented 7 years ago

Thanks for the clarification!

I totally missed the explanation of RMSD, as stated in top of psa.py

The distance between these points of maximal deviation is measured by the root mean square deviation (RMSD), i.e., to compute structural similarity.

I would like to suggest a change to the docstring to clarify this. See 6d4c609c064dc4f6cffceb77967afbc92a9d5c65

@sseyler 1. Does this suggestion improved the documentation? 2. Would the Hausdorff distance benefit from a similar additions?

tylerjereddy commented 7 years ago

I wonder if it would eventually make sense to support user-specified functions for the 'distance' in each frame. For example, RMSD might be the default, but having the option of a custom function might enable some useful things.

sseyler commented 7 years ago

@cjekel You're welcome!

I would like to suggest a change to the docstring to clarify this. See 6d4c609

I definitely agree that a bit of clarification would help here. However, the docstring you suggested isn't quite right:

+    - When only one atom is considered in *P* and *Q*, discrete_frechet returns
+      the discrete Fréchet distance between *P* and *Q* for the atom.
+    - When two or more atoms are considered, discrete_frechet first calculates
+      `N` discrete Fréchet distances for the `N` atoms. Then discrete_frechet
+      returns the root mean square of the discrete Fréchet distances.

Assuming the atomic coordinates are specified correctly (as discussed above), the discrete_frechet() function will compute the Fréchet distance for single-atom paths in the same way as for multi-atom paths. The RMSD is used in both cases as the underlying metric for comparing frames (i.e., getting the RMSD between the two conformations from a pair of frames); it's just that in the single-atom case, the RMSD reduces to the Euclidean distance.

Just to help clarify some key details, here's a schematic of the man-leash-dog analogy that I'll reference below (I hope I don't end up being too pedantic):

The dotted lines (i.e., the leash lengths) between various points on the red/blue paths in the diagram are really just the individual RMSDs that we're calculating between frames. (Note that we could, however, use a metric other than the RMSD, like the Euclidean distance, to measure between-frame distances.)
For the discrete Fréchet algorithm as implemented in psa.py, we first compute all of the RMSDs between the frames in path 1 and those in path 2. So, given a path P with length (# of frames) Np and path Q with length Nq, the total number of RMSD calculations will be Np x Nq, regardless of the number of atoms, which we store as a 2D array of RMSDs (the "RMSD matrix"). (To be precise, we actually calculate/use the MSD matrix—we wait until the very end to take the square root to avoid computing it for the entire matrix—but the idea is the same.) To get the Fréchet distance, we first need to find an "optimal" man/dog traversal, which is the one that requires the shortest leash that we'd need to traverse the two paths from beginning to end, legally—sequentially visiting each point, at least once, in each path (i.e., without stepping backward). As mentioned above, the "leash lengths" for MD trajectories (paths) are just the between-frame RMSD measurements which we catalog in an Np x Nq RMSD matrix.
The recursive aspect of the discrete Fréchet algorithm (which I admit is somewhat confusing) is probably best thought of as a means to exhaustively check every possible legal traversal (a coupling) for a given pair of paths. Each of these couplings can be thought of as the sequence of leash-length measurements between the man/dog as they hop along the frames; in our case, the "leash length sequence" is simply a sequence between-frame RMSD measurements. Then, each coupling is assigned a coupling length, which is the largest RMSD value (i.e., longest leash length) in the sequence/coupling; alternatively, we can think of the longest leash in a coupling as being smallest leash we could get away with while still traversing the paths unimpeded—any shorter and the man/dog would get "snagged" by the leash at one of those corners! Once a coupling length is assigned to every possible coupling, the Fréchet distance is taken to be the smallest coupling length (the shortest leash that can be used).
Since each coupling length is just an RMSD measurement between some pair of frames across a pair of paths, the Fréchet distance will be exactly equal to at least one of their between-frame RMSD values (all of which were initially calculated and stored in an RMSD matrix for those two paths).

Does this suggestion improved the documentation? 2. Would the Hausdorff distance benefit from a similar additions?

In light of the above, we should update your docstring suggestion to:

Remove the distinction drawn between single-atom and multi-atom paths
Explain how the Fréchet (and Hausdorff!) metric:
- requires an underlying norm to measure distances between individual frames (conformations) across a pair of paths (the point-wise metric, as we call it elsewhere)
- uses the RMSD as the point-wise metric—by default, though other metrics could be substituted—to generate an MSD matrix for a pair of paths (from which the RMSD will be computed by taking the square root at the end of the calculation).
- selects a single, representative MSD value from the MSD matrix, for a given pair of paths, to also be the Fréchet (or Hausdorff) distance for those the paths—in other words, decides upon a single between-frame MSD measurement that best represents the distance between the paths (what is "best" is determined/defined by the path metric being used).
Ensure the Hausdorff docstring (and any others) is consistent with any changes/improvements we make to the Fréchet docstring.

Phew, sorry for the long post... I hope this is helpful! Let me know what your thoughts are.

sseyler commented 7 years ago

@tylerjereddy

I wonder if it would eventually make sense to support user-specified functions for the 'distance' in each frame. For example, RMSD might be the default, but having the option of a custom function might enable some useful things.

I've been wanting for some time to add a keyword argument to the path metric functions that can be used to specify the point-wise metric—just like what's done in psa.PSAnalysis.run(), where the metric keyword can be used to select one of the built-in path metrics with a string name or given a Python/Cython function to specify a custom, user-defined path metric.

orbeckst commented 7 years ago

@cjekel , this looked like a productive discussion. Have your questions/concerns been answered?

I don't see any actionable request or bug here so I am closing the issue. Please open new issues with specific requests.

Many thanks!

cjekel commented 7 years ago

Thank you @sseyler the long discussion is helpful and provides detail on the method . Sorry @orbeckst for leaving this open for a long time.

Should the use of multiple paths be explicitly forbidden, or parse a warning? Since the Frechet distance is only between two paths? I believe this was the key area of my confusion.

In https://github.com/MDAnalysis/mdanalysis/issues/1385#issuecomment-307178046 two paths are supplied to P and two different paths are supplied to Q. It seems that the appropriate usage is to have a single path in P and Q.

orbeckst commented 7 years ago

@sseyler is better input checking that would make sense?

sseyler commented 7 years ago

@cjekel

Should the use of multiple paths be explicitly forbidden, or parse a warning? Since the Frechet distance is only between two paths? I believe this was the key area of my confusion.

I'm a bit confused as to what you mean by "multiple paths", but I think I might understand where your confusion is arising.

In #1385 (comment) two paths are supplied to P and two different paths are supplied to Q. It seems that the appropriate usage is to have a single path in P and Q.

In your code that you referenced, you're explicitly populating the x, y, and z values of each atom in each path. Then, when you generate your plot (your code repeated below),

ax.plot(P[:,0,0], P[:,0,1], P[:,0,2], label='Atom 0, Path 0')
ax.plot(Q[:,0,0], Q[:,0,1], Q[:,0,2], label='Atom 0, Path 1')

ax.plot(P[:,1,0], P[:,1,1], P[:,1,2], label='Atom 1, Path 0')
ax.plot(Q[:,1,0], Q[:,1,1], Q[:,1,2], label='Atom 1, Path 1')

you're actually plotting the separate Cartesian path of each individual atom in each path, giving what seems to be four paths (i.e., each line corresponds to an individual atom's "path" in 3D Cartesian space). So far, so good, though this is where things get a little confusing:

Say we have N atoms comprising a large molecule that's undergoing some jiggly motion. We can imagine tracking the motion, as a function of time, of each individual atom through 3D space, which we could represent as a 3D plot with N separate paths, one path per atom. The problem is that it's difficult to compare two sets of N atomic paths each; as such, in PSA, we do not view the motion of each atom as a separate path, but instead view the motion of all the atoms in the system as a single path in configuration space that describes the entire collective motion. The configuration space of our molecule will have 3N dimensions, which captures all of the changing atomic positions (over time) for the entire molecule as a single path in this space. The abstraction of the problem through the use of configuration space still contains all of the same (position) information, but it avoids the difficulties in dealing with all of those separate atomic trajectories in coming up with a way to measure similarity! Thus, in PSA, we are measuring distances using, for instance, the Fréchet metric, between 3N-dimensional paths (in configuration space) rather than collections of N individual atomic paths (in 3D space).

So, in your example above, with path P and Q each having two atoms, the configuration space is 6-dimensional and cannot be directly plotted. Each frame (first index) in P will have 6 corresponding spatial coordinates specified—for each (of the two atoms) specified by the second index in P, the current position in 3D space is given by the third index; alternatively, you could specify the full set of atomic coordinates with just the second index by concatenating the positions in a single vector:

[ x1, y1, z1, x2, y2, z2 ]

instead of

[ [ x1, y1, z1 ],
  [ x2, y2, z2 ] ]

In this case, path P (all frames) would have "flattened" shape (Nframes, 3*Natoms) = (100, 6) compared to the representation with (Nframes, Natoms, 3) = (100, 2, 3). To measure a distance between P and Q, we just use the 6D path P and the 6D path Q as inputs to the Fréchet metric. It should be easy to see how this generalizes to the N-atom case.

Let me know if this helps!

cjekel commented 7 years ago

@sseyler Thank you. I wasn't understanding how N atomic paths were handled, and your last post explained it well.

The problem is that it's difficult to compare two sets of N atomic paths each; as such, in PSA, we do not view the motion of each atom as a separate path, but instead view the motion of all the atoms in the system as a single path in configuration space that describes the entire collective motion. The configuration space of our molecule will have 3N dimensions, which captures all of the changing atomic positions (over time) for the entire molecule as a single path in this space.

Ah I see, that's a clever way to deal with that problem! I would have not thought of increasing the dimensions. It's a pretty cheap way to measure the similarity between simulations.

Thus, in PSA, we are measuring distances using, for instance, the Fréchet metric, between 3N-dimensional paths (in configuration space) rather than collections of N individual atomic paths (in 3D space).

This is a great way to deal with the problem.

Something to consider (if you haven't already), that this implementation may be sensitive to outliers. So there may be a case where one atoms throws off the Discrete Frechet distance, despite the simulation being relatively close. Here is some code to illustrate this point.

import numpy as np
from MDAnalysis.analysis import psa

n=100 #  nubmer of arbitraty time events
nAtoms = 10  # number of atoms

#   initite P, Q and R as zeros
P = np.zeros([n,nAtoms,3])
Q = np.zeros([n,nAtoms,3])
R = np.zeros([n,nAtoms,3])

#   atom locations
x = np.linspace(0,1,n)
y = np.linspace(0,1,n)
z = np.linspace(0,1,n)

#   let P be an original simulation
for i in range(0,nAtoms):
    P[:,i,0] = x
    P[:,i,1] = y
    P[:,i,2] = z

#   let Q be a simulation, where 1 atom's 'z' direction is out of whack by 3.0
Q = P.copy()
Q[:,9,2] = Q[:,9,2] + 3.0

#   Let R be a simulation, where all atom paths are 0.5 off of P
R = P.copy()
R[:,:,0] = R[:,:,0]+0.5
R[:,:,1] = R[:,:,1]+0.5
R[:,:,2] = R[:,:,2]+0.5

#   compute the discrete frechet distance
d0 = psa.discrete_frechet(P,Q)
d1 = psa.discrete_frechet(P,R)
print('psa.discrete_frechet(P,Q) ', d0)
print('psa.discrete_frechet(P,R) ', d1)

def dfForEachAtom(N,M, nAtoms):
    #   compute the DF for each atom
    df = []
    for i in range(0,nAtoms):
        df.append(psa.discrete_frechet(N[:,i,:], M[:,i,:]))
    return np.array(df)

#   compute df for each atom, then sum all of the discrete frechet distances
df0 = dfForEachAtom(P,Q,nAtoms)
df1 = dfForEachAtom(P,R,nAtoms)
print('sum of each atoms discrete frechet between P and Q', np.sum(df0))
print('sum of each atoms discrete frechet between P and R', np.sum(df1))

('psa.discrete_frechet(P,Q) ', 0.94868329805051377) ('psa.discrete_frechet(P,R) ', 0.8660254037844386) ('sum of each atoms discrete frechet between P and Q', 3.0) ('sum of each atoms discrete frechet between P and R', 8.6602540378443855)

In this case using your implementation would say that R is closer to P, than Q is to P. However summing the Discrete Frechet distances for each atom would suggest that Q is closer to P, than R is to P. Neither method is more correct than the other, rather the two approaches just have different mathematical meanings.

To measure a distance between P and Q, we just use the 6D path P and the 6D path Q as inputs to the Fréchet metric. It should be easy to see how this generalizes to the N-atom case.

Would it ever make sense to compare 1 atom of P to 2 atoms of Q? There would be no physical meaning of this, correct? So a 3D input of P and a 6D input of Q? If this is a problem, the code should either strictly enforce (or pass a warning) that P and Q be the same shape.

I'd like to make suggestions to the docstring that reflect this discussion. I think that RMSD and 3N dimensions could be stated in the notes. I'll revert my previous changes and provide a few suggestions.

Thanks for the explanation, I find this work particularly interesting.

orbeckst commented 7 years ago

Something to consider (if you haven't already), that this implementation may be sensitive to outliers. So there may be a case where one atoms throws off the Discrete Frechet distance, despite the simulation being relatively close. Here is some code to illustrate this point.

This is well-known for these types of metrics. We argue that you would want to know about outliers.

In the PLOS Comp Biol paper we also look at other distance functions that do some averaging and which are less susceptible to outliers but @sseyler could prove that they are not metrics, which makes them less useful for clustering.

sseyler commented 7 years ago

Would it ever make sense to compare 1 atom of P to 2 atoms of Q? There would be no physical meaning of this, correct? So a 3D input of P and a 6D input of Q? If this is a problem, the code should either strictly enforce (or pass a warning) that P and Q be the same shape.

No, they should have the same number of atomic coordinates. I'll see if some more checks can be added to raise appropriate warnings when necessary.

I'd like to make suggestions to the docstring that reflect this discussion. I think that RMSD and 3N dimensions could be stated in the notes. I'll revert my previous changes and provide a few suggestions.

Sounds good! Looking forward to your thoughts.

Thanks for the explanation, I find this work particularly interesting.

Glad I could help!

orbeckst commented 7 years ago

@cjekel can you please open a new issue for your request to improve the PSA docs? It would be super-helpful if you could summarize the points that you think should be improved or added. (Do reference this issue #1385 somewhere in your new issue to create a link). Thanks!