Some initial testing with awkward array

tylerjereddy commented 2 years ago

I've been meaning to give awkward array a try because lipid and protein "residues" with different numbers of atoms seemed like a potentially good fit for the description:

Awkward Array is a library for nested, variable-sized data, including arbitrary-length lists, records, mixed types, and missing data, using NumPy-like idioms.

Arrays are dynamically typed, but operations on them are compiled and fast. Their behavior coincides with NumPy when array dimensions are regular and generalizes when they’re not.

The simplest use case I could think of was calculating residue centroids, which the team had previously optimized in various places including gh-1903. With the diff below, I do seem to produce reasonable results for the "benchmark script," but unfortunately performance isn't even close to as good as develop.

diff --git a/package/MDAnalysis/core/groups.py b/package/MDAnalysis/core/groups.py
index 097d4fa0b..f20354004 100644
--- a/package/MDAnalysis/core/groups.py
+++ b/package/MDAnalysis/core/groups.py
@@ -90,6 +90,7 @@ that two objects are of the same level, or to access a particular class::
 """
 from collections import namedtuple
 import numpy as np
+import awkward as ak
 import functools
 import itertools
 import numbers
@@ -1068,6 +1069,18 @@ class GroupBase(_MutableBase):
             weights = weights.astype(dtype, copy=False)
             return (coords * weights[:, None]).sum(axis=0) / weights.sum()

+        if comp == "residues" and weights is None and not unwrap and not wrap:
+            # we're doing a bunch of work on jagged residue
+            # data structures, so use a library designed for
+            # exactly these kinds of data structures
+            split_locs = np.cumsum(np.unique(atoms.resindices,
+                                             return_counts=True)[1])[:-1]
+            split_arr = np.split(atoms.residues.atoms.positions,
+                                 split_locs)
+            awk_arr = ak.Array(split_arr)
+            centers = ak.mean(awk_arr, axis=1)
+            return ak.to_numpy(centers)
+
         # When compound split caching gets implemented it will be clever to
         # preempt at this point whether or not stable sorting will be needed
         # later for unwrap (so that we don't split now with non-stable sort,

and script:

import time

import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD

u = mda.Universe(PSF, DCD)
atoms = u.atoms
print("num residues:", atoms.n_residues)

start = time.perf_counter()
residue_centers = atoms.centroid(compound="residues")
end = time.perf_counter()

assert residue_centers.shape[0] == len(u.residues)

print("elapsed (s):", end - start)

elapsed (s): 0.05413260700152023 on the diff vs. elapsed (s): 0.0006493579930975102 on develop.

Perhaps this is because the test system is small, but in large part I think it is due to the fact that the "diversity" is low--there are at most 20 possible residue types for a protein-only system, and likely some have the same number of atoms, which further reduces the looping burden. So, I guess there really isn't that much computation to do in any case.

Line profiler shows tons of time on creating the new library object:

  1073         1          1.0      1.0      0.0          if comp == "residues" and weights is None and not unwrap and not wrap:
  1074                                                       # we're doing a bunch of work on jagged residue
  1075                                                       # data structures, so use a library designed for
  1076                                                       # exactly these kinds of data structures
  1077         4        157.0     39.2      0.3              split_locs = np.cumsum(np.unique(atoms.resindices,
  1078         3          2.0      0.7      0.0                                               return_counts=True)[1])[:-1]
  1079         2       1934.0    967.0      3.1              split_arr = np.split(atoms.residues.atoms.positions,
  1080         1          1.0      1.0      0.0                                   split_locs)
  1081         1      56799.0  56799.0     91.1              awk_arr = ak.Array(split_arr)
  1082         1       2547.0   2547.0      4.1              centers = ak.mean(awk_arr, axis=1)
  1083         1        881.0    881.0      1.4              return ak.to_numpy(centers)

Perhaps there are still interesting use cases, the other one I had in mind was iteration over a large diversity of polygons (triangles, squares, etc.), but even there I'm not sure if the heterogeneity is typically large enough to justify the overhead for most computational geometry scenarios.

hmacdope commented 2 years ago

IFF we decide to implement an updating MSD algorithm this could definitely be useful. :)

richardjgowers commented 2 years ago

One nice thing about using awkward would be removing lots of our code which does the same. So even if performance doesn't change its still a win

tylerjereddy commented 2 years ago

I opened the cross-listed issue upstream to ask about the performance situation, or to see if I'm just badly misusing the library.

agoose77 commented 1 year ago

Just stumbled across this again whilst looking at our issue tracker.

The code posted above can be improved:

        if comp == "residues" and weights is None and not unwrap and not wrap:
            import awkward as ak
            # we're doing a bunch of work on jagged residue
            # data structures, so use a library designed for
            # exactly these kinds of data structures
            repeated_index_counts = np.unique(atoms.resindices, return_counts=True)[1]
            grouped = ak.unflatten(atoms.residues.atoms.positions, repeated_index_counts)
            centers = ak.mean(grouped, axis=1)
            return ak.to_numpy(centers)

This avoids any from_iter entirely. I am not familiar enough with this package to know how to generate a bigger dataset; the test sample is very small, consisting of only ~200 sublists, whilst Awkward is designed to bring performance advantages for larger datasets.

MDAnalysis / mdanalysis

Some initial testing with awkward array #3776