Closed mimischi closed 6 years ago
@richardjgowers / @dotsdl how are we handling ResidueGroup.center_of_mass()
in #363?
So currently *Group.center_of_mass == *Group.atoms.center_of_mass()
like in develop. It could be changed however..
There was some discussion in #411 on how "aggregate" functions should work at the SegmentGroup and ResidueGroup level. I think we concluded that singular method such as g.center_of_mass()
should always be equivalent to g.atoms.center_of_mass()
.
I don't think we introduced "plural" methods in #363 such as g.centers_of_mass()
that would do
np.array([member.center_of_mass() for member in g])
Admittedly, having the plural s inside the method name as I wrote here is pretty confusing although grammatically correct. There was also the idea https://github.com/MDAnalysis/mdanalysis/issues/385#issuecomment-136579006 to use something like ResidueGroup.byres("center_of_mass")
to force a calculation on a per-residue basis.
In all cases, we would need to think if there are additional performance enhancements that we could do (in addition to the syntactic sugar).
@mimischi , what's your primary concern: performance or ease of writing the operation?
@orbeckst I was wondering about the performance.
Actually I want to calculate the distance between the center of mass of residues and a reference point at every time frame. Doing that loop, as mentioned above, and going over every residue to retrieve the value just seems like a slow task.
Actually I want to calculate the distance between the center of mass of residues and a reference point at every time frame. Doing that loop, as mentioned above, and going over every residue to retrieve the value just seems like a slow task.
At the moment I can't think of a nice way to calculate the C.O.M. for a bunch of residues without a loop.
residue_coms = np.array([r.atoms.center_of_mass() for r in u.select_atoms("protein").residues])
(and this would probably be the under-the-hood implementation of the requested feature). Btw, you might be able to speed-up what you showed by using a list comprehension instead of an explicit for
loop.
I think the problem is a fair bit more tractable when all residues have the same shape (number of atoms). If we simplify for the case of the center of geometry (centroid) of each residue for an atomgroup with 12 residues, each of which has 10 atoms, then I suspect we'd want a data structure with this shape: (12, 10, 3). Then, I believe it should be possible to use a standard numpy vectorized ufunc on the appropriate axis to determine the means (centroids). For the center of mass, I suspect we'd just add (one more?) vectorized operation that (multiplies by?) includes the weights for each atom in an array of the same shape.
Ok, so what about for real data with heterogenously-sized residues (amino acids, lipids, etc.)? I suspect that we could simply NaN
-fill to produce a homogenous array. Generally, the fast vectorized numpy operations will not work on jagged arrays (which are typed as object
). Once the NaN
filling is done, we can repeat the above steps for the homogenous array, as numpy tends to gracefully handle / ignore NaN
for many array operations & so combining the weights and masses with the appropriate axes / slicing shouldn't be too bad.
The only thing I don't like about this approach is that we need to know the maximum number of atoms for any residue in the given atomgroup so that we can NaN
-fill appropriately. In a biomolecular context, I suspect we might get away with a user-adjustable default value that exceeds any reasonable atom count for a conventional amino acid, lipid, or nucleotide / glycan. I wonder if there's a fast way that MDAnalysis could parse this out during i.e., universe / topology initialization -- the other core devs would likely know more about this. That would probably allow automated NaN
-filling in a less-awkward / more robust & topology-aware manner.
Ragged arrays aren't too impossible with a few extra arrays
# The contents of each Group/Residue
g1 = [10, 12, 13, 14]
g2 = [51, 61, 71]
g3 = [8, 9, 10]
# The size of each group
sizes = [4, 3, 3]
# The contents of each group concatenated
identities = [10, 12, 13, 14, 51, 61, 71, 8, 9, 10]
offset = 0
for s in sizes: # loop over groups
for i in range(s): # loop over atoms in this group
atom = identities[offset + i]
offset += s
Heterogenous arrays aren't impossible at all--I'm just targeting raw performance. That's a nested for loop in pure python, so I'm not sure it would fare much better than the original list comprehension approach. Could be cythonized though, I suppose.
Even with the Nan
-fill approach, I'm still thinking that a bit of creativity would be needed to avoid any looping. I guess you'd preallocate an empty array of the appropriate shape for n_residues
and then somehow assign / add in the heterogenous coordinates in such a way that the residue atoms don't overlap. But the original and target arrays have a different shape, so that may be trickier than I thought.
The NaN filling sounds complicated and a bit brittle – aren't glycans single residues? Lipids are, and you can have big lipids.
I would try a hybrid approach for residues: You could group all residues with same number of atoms into arrays, remember the residue indices, work on these blocks, and then assemble results in the correct sequence.
For all waters this should give a good speed-up but even cutting down a protein with 500 residues into 20 blocks with 500/20 = 25 residues each you will probably see improvements.
For segments, where we typically only have O(1) - O(10) we can probably just do the list comprehension.
Admittedly, having the plural s inside the method name as I wrote here –
ag.centers_of_mass()
— is pretty confusing.
We can be fancy and use ag.barycenters()
(W:barycenter) :-).
aren't glycans single residues?
Glycolipids are usually treated as a single residue in CG. I think that makes more sense than having the molecule split in two or more pieces topologically for various reasons (they are often parametrized as a custom-made unit, and we are often interested in their properties as a unit, etc.). I think for glycoproteins the glycans can be separate residues though, depending on the FF maybe.
The NaN filling sounds complicated and a bit brittle
Maybe. Hard to say without actually trying and comparing different approaches. I suspect there are tradeoffs in terms of performance gained and the assumptions you can introduce. A more brittle solution might be faster because it can make more assumptions or preallocate larger arrays for vectorization.
https://gist.github.com/richardjgowers/0a63f12fa207f26de201e586ee22f4d7
@mimischi I put together this to see what the fastest we can do is. It takes ~3.5 ms, and just constructing the AtomGroup from each residue is 3 ms, so it looks like the bottleneck is there now.
How to do the barycenter when you have a block of identical residues, eg all TIP4P waters with 4 atoms each:
from MDAnalysisTests.datafiles import TPR, XTC
waters = u.select_atoms("resname SOL")
natoms = 4
barycenters = (waters.positions * waters.masses[:, np.newaxis]).reshape(-1, natoms, 3).mean(axis=1)
/ waters.masses.reshape(-1, natoms).sum(axis=1)[:, np.newaxis]
Expected behaviour
Calling the methods
center_of_mass
orcenter_of_gravity
on a ResidueGroup I would expect to retrieve the CoM/CoG of each residue in an array:Actual behaviour
These methods return the CoM/CoG of the whole selection, just like calling
protein.center_of_mass()
orprotein.atoms.center_of_mass()
:Current solution
This works, but seems somehow slow when called for each frame in a bigger system.
Currently version of MDAnalysis:
0.15.0