Center of mass/gravity for all residues of a ResidueGroup

mimischi commented 7 years ago

Expected behaviour

Calling the methods center_of_mass or center_of_gravity on a ResidueGroup I would expect to retrieve the CoM/CoG of each residue in an array:

In [1]: import MDAnalysis as mda

In [2]: from MDAnalysis.tests.datafiles import PSF, DCD

In [3]: u = mda.Universe(PSF, DCD)

In [4]: protein = u.select_atoms('protein')

In [5]: protein.residues.center_of_mass()
Out [5]: [ 10.52567673   9.49548312  -8.15335145]
[ 13.90618512   4.59222315  -7.35309049]
[ 8.12902093  6.03077052 -3.66130656]
....
[  7.76611535  16.02359343   0.44614907]
[  7.75536069  14.75541562  -4.00511608]
[  6.74699649  18.21034743  -6.73924321]

Actual behaviour

These methods return the CoM/CoG of the whole selection, just like calling protein.center_of_mass() or protein.atoms.center_of_mass():

In [1]: import MDAnalysis as mda

In [2]: from MDAnalysis.tests.datafiles import PSF, DCD

In [3]: u = mda.Universe(PSF, DCD)

In [4]: protein = u.select_atoms('protein')

In [5]: protein.residues.center_of_mass()
Out [5]: array([-0.01094035,  0.05727601, -0.12885778])

Current solution

This works, but seems somehow slow when called for each frame in a bigger system.

for res in protein.residues:
    print res.center_of_mass()

Currently version of MDAnalysis:

0.15.0

orbeckst commented 7 years ago

@richardjgowers / @dotsdl how are we handling ResidueGroup.center_of_mass() in #363?

richardjgowers commented 7 years ago

So currently *Group.center_of_mass == *Group.atoms.center_of_mass() like in develop. It could be changed however..

orbeckst commented 7 years ago

There was some discussion in #411 on how "aggregate" functions should work at the SegmentGroup and ResidueGroup level. I think we concluded that singular method such as g.center_of_mass() should always be equivalent to g.atoms.center_of_mass().

I don't think we introduced "plural" methods in #363 such as g.centers_of_mass() that would do

np.array([member.center_of_mass() for member in g])

Admittedly, having the plural s inside the method name as I wrote here is pretty confusing although grammatically correct. There was also the idea https://github.com/MDAnalysis/mdanalysis/issues/385#issuecomment-136579006 to use something like ResidueGroup.byres("center_of_mass") to force a calculation on a per-residue basis.

In all cases, we would need to think if there are additional performance enhancements that we could do (in addition to the syntactic sugar).

@mimischi , what's your primary concern: performance or ease of writing the operation?

mimischi commented 7 years ago

@orbeckst I was wondering about the performance.

Actually I want to calculate the distance between the center of mass of residues and a reference point at every time frame. Doing that loop, as mentioned above, and going over every residue to retrieve the value just seems like a slow task.

orbeckst commented 7 years ago

Actually I want to calculate the distance between the center of mass of residues and a reference point at every time frame. Doing that loop, as mentioned above, and going over every residue to retrieve the value just seems like a slow task.

At the moment I can't think of a nice way to calculate the C.O.M. for a bunch of residues without a loop.

residue_coms =  np.array([r.atoms.center_of_mass() for r in u.select_atoms("protein").residues])

(and this would probably be the under-the-hood implementation of the requested feature). Btw, you might be able to speed-up what you showed by using a list comprehension instead of an explicit for loop.

tylerjereddy commented 7 years ago

I think the problem is a fair bit more tractable when all residues have the same shape (number of atoms). If we simplify for the case of the center of geometry (centroid) of each residue for an atomgroup with 12 residues, each of which has 10 atoms, then I suspect we'd want a data structure with this shape: (12, 10, 3). Then, I believe it should be possible to use a standard numpy vectorized ufunc on the appropriate axis to determine the means (centroids). For the center of mass, I suspect we'd just add (one more?) vectorized operation that (multiplies by?) includes the weights for each atom in an array of the same shape.

Ok, so what about for real data with heterogenously-sized residues (amino acids, lipids, etc.)? I suspect that we could simply NaN-fill to produce a homogenous array. Generally, the fast vectorized numpy operations will not work on jagged arrays (which are typed as object). Once the NaN filling is done, we can repeat the above steps for the homogenous array, as numpy tends to gracefully handle / ignore NaN for many array operations & so combining the weights and masses with the appropriate axes / slicing shouldn't be too bad.

The only thing I don't like about this approach is that we need to know the maximum number of atoms for any residue in the given atomgroup so that we can NaN-fill appropriately. In a biomolecular context, I suspect we might get away with a user-adjustable default value that exceeds any reasonable atom count for a conventional amino acid, lipid, or nucleotide / glycan. I wonder if there's a fast way that MDAnalysis could parse this out during i.e., universe / topology initialization -- the other core devs would likely know more about this. That would probably allow automated NaN-filling in a less-awkward / more robust & topology-aware manner.

richardjgowers commented 7 years ago

Ragged arrays aren't too impossible with a few extra arrays

# The contents of each Group/Residue
g1 = [10, 12, 13, 14]
g2 = [51, 61, 71]
g3 = [8, 9, 10]

# The size of each group
sizes = [4, 3, 3]
# The contents of each group concatenated
identities = [10, 12, 13, 14, 51, 61, 71, 8, 9, 10] 

offset = 0
for s in sizes:  # loop over groups
    for i in range(s):  # loop over atoms in this group
        atom = identities[offset + i]
    offset += s

tylerjereddy commented 7 years ago

Heterogenous arrays aren't impossible at all--I'm just targeting raw performance. That's a nested for loop in pure python, so I'm not sure it would fare much better than the original list comprehension approach. Could be cythonized though, I suppose.

Even with the Nan-fill approach, I'm still thinking that a bit of creativity would be needed to avoid any looping. I guess you'd preallocate an empty array of the appropriate shape for n_residues and then somehow assign / add in the heterogenous coordinates in such a way that the residue atoms don't overlap. But the original and target arrays have a different shape, so that may be trickier than I thought.

orbeckst commented 7 years ago

The NaN filling sounds complicated and a bit brittle – aren't glycans single residues? Lipids are, and you can have big lipids.

I would try a hybrid approach for residues: You could group all residues with same number of atoms into arrays, remember the residue indices, work on these blocks, and then assemble results in the correct sequence.

For all waters this should give a good speed-up but even cutting down a protein with 500 residues into 20 blocks with 500/20 = 25 residues each you will probably see improvements.

For segments, where we typically only have O(1) - O(10) we can probably just do the list comprehension.

orbeckst commented 7 years ago

Admittedly, having the plural s inside the method name as I wrote here – ag.centers_of_mass() — is pretty confusing.

We can be fancy and use ag.barycenters() (W:barycenter) :-).

tylerjereddy commented 7 years ago

aren't glycans single residues?

Glycolipids are usually treated as a single residue in CG. I think that makes more sense than having the molecule split in two or more pieces topologically for various reasons (they are often parametrized as a custom-made unit, and we are often interested in their properties as a unit, etc.). I think for glycoproteins the glycans can be separate residues though, depending on the FF maybe.

The NaN filling sounds complicated and a bit brittle

Maybe. Hard to say without actually trying and comparing different approaches. I suspect there are tradeoffs in terms of performance gained and the assumptions you can introduce. A more brittle solution might be faster because it can make more assumptions or preallocate larger arrays for vectorization.

richardjgowers commented 7 years ago

https://gist.github.com/richardjgowers/0a63f12fa207f26de201e586ee22f4d7

@mimischi I put together this to see what the fastest we can do is. It takes ~3.5 ms, and just constructing the AtomGroup from each residue is 3 ms, so it looks like the bottleneck is there now.

orbeckst commented 7 years ago

How to do the barycenter when you have a block of identical residues, eg all TIP4P waters with 4 atoms each:

from MDAnalysisTests.datafiles import TPR, XTC
waters = u.select_atoms("resname SOL")

natoms = 4
barycenters = (waters.positions * waters.masses[:, np.newaxis]).reshape(-1, natoms, 3).mean(axis=1) 
   / waters.masses.reshape(-1, natoms).sum(axis=1)[:, np.newaxis]

MDAnalysis / mdanalysis