Open tylerjereddy opened 2 years ago
IFF we decide to implement an updating MSD algorithm this could definitely be useful. :)
One nice thing about using awkward would be removing lots of our code which does the same. So even if performance doesn't change its still a win
I opened the cross-listed issue upstream to ask about the performance situation, or to see if I'm just badly misusing the library.
Just stumbled across this again whilst looking at our issue tracker.
The code posted above can be improved:
if comp == "residues" and weights is None and not unwrap and not wrap:
import awkward as ak
# we're doing a bunch of work on jagged residue
# data structures, so use a library designed for
# exactly these kinds of data structures
repeated_index_counts = np.unique(atoms.resindices, return_counts=True)[1]
grouped = ak.unflatten(atoms.residues.atoms.positions, repeated_index_counts)
centers = ak.mean(grouped, axis=1)
return ak.to_numpy(centers)
This avoids any from_iter
entirely. I am not familiar enough with this package to know how to generate a bigger dataset; the test sample is very small, consisting of only ~200 sublists, whilst Awkward is designed to bring performance advantages for larger datasets.
I've been meaning to give awkward array a try because lipid and protein "residues" with different numbers of atoms seemed like a potentially good fit for the description:
The simplest use case I could think of was calculating residue centroids, which the team had previously optimized in various places including gh-1903. With the diff below, I do seem to produce reasonable results for the "benchmark script," but unfortunately performance isn't even close to as good as
develop
.and script:
elapsed (s): 0.05413260700152023
on the diff vs.elapsed (s): 0.0006493579930975102
ondevelop
.Perhaps this is because the test system is small, but in large part I think it is due to the fact that the "diversity" is low--there are at most 20 possible residue types for a protein-only system, and likely some have the same number of atoms, which further reduces the looping burden. So, I guess there really isn't that much computation to do in any case.
Line profiler shows tons of time on creating the new library object:
Perhaps there are still interesting use cases, the other one I had in mind was iteration over a large diversity of polygons (triangles, squares, etc.), but even there I'm not sure if the heterogeneity is typically large enough to justify the overhead for most computational geometry scenarios.