Change Memory layout of positions from Fortran- to C-contiguous

MDAnalysis / mdanalysis

MDAnalysis is a Python library to analyze molecular dynamics simulations.

https://mdanalysis.org

Other

1.3k stars 647 forks source link

Change Memory layout of positions from Fortran- to C-contiguous #2210

Closed zemanj closed 3 years ago

zemanj commented 5 years ago

Is your feature request related to a problem? Please describe. Currently, all functions that operate (especially in-place) on atom positions obtain a copy of atom positions via AtomGroup.positions, regardless of whether positions are modified in- or out-of-place, or they are merely input data that is not altered. In some (if not many) cases, a tremendous speed-up could be achieved if the respective functions could obtain a view instead of a copy of an AtomGroup's positions. At the moment, this is not possible for two reasons:

AtomGroup.positions gets its data from Universe.trajectory.ts.positions via fancy indexing, and fancy indexing means copying.
More importantly, Universe.trajectory.ts.positions refers to coordinates.base.Timestep._pos, which is a Fortran-contiguous array. Therefore, potentially obtained views of that array are unsuitable for use with C-level functions.

Describe the solution you'd like Change coordinates.base.Timestep.order from 'F' to 'C' and provide a way of obtaining a (if possible, C-contiguous) view of an AtomGroup's positions.

Describe alternatives you've considered Invent computers with two-dimensional main memory...

Additional context Two candidates that come to my mind in terms of speed-up are *Group.center*() and *Group.wrap() when the index array *Group.atoms.ix is a (contiguous) range.

A possible drawback of changing the positions/velocities/forces memory layout is that reading and (probably to a lesser extent) writing of DCD trajectories will be a bit slower since DCD stores positions etc. as Fortran-contiguous arrays.

If you have any additional concerns, please let me know!

zemanj commented 5 years ago

Trajectory Reading Performance Impact

I tested the runtimes of trajectory reading (from a RAM disk) for various trajectory formats.

tl;dr: As expected, DCD becomes significantly slower when holding positions in C-contiguous arrays. Reading any other format becomes either faster or the change is negligible.

Test code:

import os
import shutil
from collections import defaultdict
from warnings import filterwarnings
from timeit import timeit
import MDAnalysis as mda
from MDAnalysisTests.datafiles import *
filterwarnings('ignore')

###### HELPER FUNCTIONS ######

def copy_trajectories(destination, trajectories):
    for i in range(len(trajectories)):
        trjfile, name = trajectories[i]
        if isinstance(trjfile, tuple):
            trjfile = list(trjfile)
            for j in range(len(trjfile)):
                dest = os.path.join(destination, os.path.basename(trjfile[j]))
                shutil.copyfile(trjfile[j], dest)
                trjfile[j] = dest
            trjfile = tuple(trjfile)
            trajectories[i] = (trjfile, name)
        else:
            dest = os.path.join(destination, os.path.basename(trjfile))
            shutil.copyfile(trjfile, dest)
            trajectories[i] = (dest, name)

def universe(trjfile, name):
    if isinstance(trjfile, tuple):
        return mda.Universe(*trjfile, format=formats[name])
    return mda.Universe(trjfile, format=formats[name])

def readall(trjfile, name):
    trj = universe(trjfile, name).trajectory
    for _ in trj:
        pass

def get_atoms_frames(trjfile, name):
    u = universe(trjfile, name)
    return u.atoms.n_atoms, u.trajectory.n_frames

def time_trajectory_read(trjfile, name):
    n_atoms, n_frames = get_atoms_frames(trjfile, name)
    normfactor = 1.0 / (n_runs * n_frames * n_atoms)
    # Read once:
    readall(trjfile, name)
    # Get runtime per frame:
    t = timeit(lambda: readall(trjfile, name), number=n_runs) * normfactor
    # Print result:
    print("{:14s}     {:.3e}  {:>5}  {:>6}  {:>5}"
          "".format(name, t, n_runs, n_frames, n_atoms))

###### SETUP ######

n_runs = 100

formats = defaultdict(lambda: None)
formats['LAMMPSDUMP'] = 'LAMMPSDUMP'
formats['DLP'] = 'HISTORY'

ramdisk = '/media/ramdisk'

trajectories = [(CRD, 'CRD'),
                (DCD, 'DCD'),
                (DLP_HISTORY, 'DLP'),
                (DMS, 'DMS'),
                (GMS_ASYMSURF, 'GMS'),
                (GRO, 'GRO'),
                (GSD, 'GSD'),
                (LAMMPSdata, 'LAMMPS'),
                (LAMMPSDUMP, 'LAMMPSDUMP'),
                (MMTF, 'MMTF'),
                (MMTF_gz, 'MMTF_gz'),
                (mol2_molecules, 'MOL2'),
                (NCDF, 'NCDF'),
                (PDB_multiframe, 'PDB_multiframe'),
                (PDBQT_input, 'PDBQT'),
                (PDB_xvf, 'PDB_xvf'),
                (PQR, 'PQR'),
                (TRR, 'TRR'),
                ((TRZ_psf, TRZ), 'TRZ'),
                (TXYZ, 'TXYZ'),
                (XTC, 'XTC'),
                (XYZ, 'XYZ')]

copy_trajectories(ramdisk, trajectories)

###### RUNTIME MEASUREMENT ######

print("Format          s/frame/atom   runs  frames  atoms")
print("--------------------------------------------------")
for trjfile, name in trajectories:
    time_trajectory_read(trjfile, name)

Comparison of Test Results: new (C-contiguous) vs. old (F-contiguous)

             new, C-cont.  old, F-cont.  (old-new)/old
Format       s/frame/atom  s/frame/atom  relative gain
------------------------------------------------------
CRD             1.060E-05     1.084E-05          2.21%
DCD             2.554E-08     2.171E-08        -17.64%
DLP             1.284E-05     1.311E-05          2.06%
DMS             1.725E-05     1.821E-05          5.27%
GMS             6.183E-05     6.282E-05          1.58%
GRO             7.737E-06     7.962E-06          2.83%
GSD             5.728E-07     5.732E-07          0.07%
LAMMPS          1.462E-05     1.477E-05          1.02%
LAMMPSDUMP      3.420E-05     3.415E-05         -0.15%
MMTF            7.418E-06     7.407E-06         -0.15%
MMTF_gz         4.304E-06     4.347E-06          0.99%
MOL2            3.149E-06     3.175E-06          0.82%
NCDF            2.225E-08     2.262E-08          1.64%
PDB_multiframe  2.276E-06     2.315E-06          1.68%
PDBQT           5.230E-06     5.253E-06          0.44%
PDB_xvf         1.015E-05     1.024E-05          0.88%
PQR             4.238E-06     4.245E-06          0.16%
TRR             1.537E-07     1.555E-07          1.16%
TRZ             7.711E-07     8.087E-07          4.65%
TXYZ            1.495E-04     1.485E-04         -0.67%
XTC             4.778E-08     4.868E-08          1.85%
XYZ             1.512E-06     1.518E-06          0.40%
------------------------------------------------------
Average         1.584E-05     1.593E-05          0.58%

orbeckst commented 5 years ago

Thorough impact analysis!!!

Is the impact on DCD still as big when you read from a spinning platter disk?
What is the speed benefit on the other operations that you mentioned?

Note that historically the speed of DCD reading went down after the transition to cython, so DCD has already taken a hit compared to pre 0.16 (?).

zemanj commented 5 years ago

@orbeckst I tried running the tests on a HDD, SSD and also from an NFS but the test files are so small that the OS caches the files, therefore yielding identical results. I don't have a large DCD file available (i.e., larger than my 16 GB main memory). Or is there a way to enforce reads from disk?

zemanj commented 5 years ago

I'm pretty sure that the impact will be much lower if the file cannot be cached. The real bottleneck should then be disk I/O and the in-memory array transposition penalty will probably be hidden.

zemanj commented 5 years ago

@orbeckst Concerning the speedup of other operations, I noticed that wrapping positions of an entire medium-sized system becomes seven times faster. Of course, that speedup won't be available if the atomgroup indices are strided or shuffled. In such cases, one cannot get a C-contiguous view of the coordinates (i.e., a copy has to be made) and the performance is likely the same as before. I will provide some more robust numbers in PR #2211 soon.

zemanj commented 5 years ago

@orbeckst

Note that historically the speed of DCD reading went down after the transition to cython, so DCD has already taken a hit compared to pre 0.16 (?).

Do you recall how it used to be implemented before?

I'm not a big fan of cython tbh. I'd prefer MDAnalysis being mainly a C++-based framework with a Python interface connected via pybind11. But that would basically require re-implementing major parts of the code base, and that's unlikely to happen anytime soon.

kain88-de commented 5 years ago

What is the effect of this on distance calculations and complex selections. The FORTRAN order is chosen because it fits to the compute intensive applications we have. Before a change is made this should be checked as well.

zemanj commented 5 years ago

@kain88-de It has the potential to improve the performance of all operations on positions/velocities/forces that obtain their data from an AtomGroup whose indices can be represented as a slice. Since many operations are performed on residues or molecules, whose atoms are often stored consecutively in a trajectory, one can avoid unnecessary copies of the respective position data.

What is the effect of this on distance calculations and complex selections.

For disttance calculations, which usually don't alter positions, one can potentially feed the algorithms with C-contiguous views instead of copies. Note that, however, it is not just the copying that slows down the operations, but also the fancy indexing itself that can be avoided if it is possible to represent an AtomGroup's indices (ix) by a slice. Complex selections that are distance-based usually lead to AtomGroups with non-contiguous or non-uniformly strided indices. For such groups, the memory layout is irrelevant.

The FORTRAN order is chosen because it fits to the compute intensive applications we have.

Could you please elaborate? Do you have an example of an operation that requires a Fortran-contiguous memory layout?

I know that having positions as xxx...xxxyyy...yyyzzz...zzz in memory can yield huge performance boosts when optimizing for SSE/AVX vector ops. However, this is usually only true as long as data fits in the CPU cache. For larger amounts of data, it is better to adopt an L1-optimized and cache-aligned hybrid scheme where positions are stored in small Fortran-contiguous blocks.

AtomGroup.positions, by the way, always returns a C-contiguous copy of the current Timestep's positions. That is (up to now) independent of the memory layout of coordinates.base.Timestep._pos.

Before a change is made this should be checked as well.

I absolutely agree!

tylerjereddy commented 5 years ago

I'd prefer MDAnalysis being mainly a C++-based framework with a Python interface connected via pybind11.

-1 on that; in my view the language choices are also to enable the broader Python community to make contributions & not to hit us too hard with reviewer burden.

zemanj commented 5 years ago

@tylerjereddy Point taken. Doesn't change my opinion on cython, though. :wink:

richardjgowers commented 5 years ago

I think Tyler is right wrt this package, it’s a python package so really we should use Cython.

That said, I’d be curious what speed up we can get if we rewrite our distance code. I know you’ve tinkered with this before, maybe we could do a C++/pybind package that provides basic distance calculations (like BLAS/lapack) in a non MDA specific way. On Thu, Feb 28, 2019 at 4:05 PM, Johannes Zeman notifications@github.com wrote:

@tylerjereddy https://github.com/tylerjereddy Point taken. Doesn't change my opinion on cython, though. 😉

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/2210#issuecomment-468438176, or mute the thread https://github.com/notifications/unsubscribe-auth/AI0jB8L6wOgAgQFRvqGjcdOyxeZihPCmks5vSESsgaJpZM4bV6Zt .

zemanj commented 5 years ago

Maybe I should clarify (and backpedal) a bit: I don't generally think that using Cython is a bad choice. I just had quite bad experiences when allowing it to automatically mangle C++ templates into its auto-generated code, or when trying to use LTO between different compilation units. I also saw reproducible changes in performance of parts of the code when changing something (seemingly) unrelated a few lines away.

I therefore usually prefer having low-level C/C++ implementations in separate compilation units and use Cython just as glue code. That does not apply to functions which should be inlined (as I mentioned, tackling inlining with LTO can be problematic).

On the other hand, having the possibility to call numpy methods (most of which are hard to beat in terms of performance) from Cython is something invaluable, so there's usually a trade-off between pure Cython and hybrid C/C++/Cython implementations.

tylerjereddy commented 5 years ago

Slightly off-topic, but another thing I'm curious about are possible MD applications for __array_function__ and its low-level equivalent. Perhaps some of you have thought about that already wrt custom handling of coordinates or residues or units, etc.

kain88-de commented 5 years ago

Could you please elaborate? Do you have an example of an operation that requires a Fortran-contiguous memory layout?

Have a look at this loop. Here we go over the z-coordinate last. So we read a whole bunch of z values before y or x are updated. This allows us to make optimal use of a cache line on the z values, i.e. we can to several iterations of the loop without reading another cache line. This means the CPU can prefetch memory and keep the cores busy. Using a C-layout would require us to read from memory much more often as x and y values we do not use in the loop need to be read. Fetching memory is slow. This pattern keeps repeating for almost all compute-intensive calculations that MDAnalysis does. Therefore an F-layout makes some sense for us.

I do understand that it can have a performance improvement for slicing. And it's great you made the benchmarks to show how much!

Because compute and slicing needs orthogonal optimizations I think it's beneficial to benchmark both when we optimize one or the other.

zemanj commented 5 years ago

@kain88-de Ok then I can put your mind at ease, none of the functions in distances.h (really, not even a single one) and, likewise, in lib.distances is affected by changing the memory layout of coordinates.base.Timestep._pos.

kain88-de commented 5 years ago

What makes you so sure of this?

zemanj commented 5 years ago

@kain88-de You can tell that already from the function signatures in calc_distances.h (and also from the loops in the function bodies, of course). All functions take positions as coordinate* pointers, i.e. the arrays behind the pointers must be contiguous arrays of type coordinate, which is defined as

typedef float coordinate[3]

where each coordinate holds the xyz data of one position vector (corresponds to rvec in gromacs, in case you're familiar with that code). This means that any input array to those functions must have a C-contiguous xyzxyz...xyz memory layout for the functions to work as desired.

The only exceptions to that rule are minimum_image(), minimum_image_triclinic(), and _calc_dihedral_angle(), which are designed to process single vectors. So these functions don't even take coordinate arrays as input, but instead single (double precision) xyz position/distance vectors, which of course must not be strided.

If that is not enough to convince you, let's have a look at the Python interfaces of those functions in lib.distances. There, every function is decorated with the @check_coords() decorator, which ensures that input coordinate arrays are C-contiguous (line 1191 in lib/util.py):

coords = coords.astype(np.float32, order='C', copy=enforce_copy)

Now, if you wonder why position arrays are converted to C-layout in the first place if they are generally Fortran-contiguous: Turns out they aren't! Of course, they are stored in Fortran-contiguous order in coordinates.base.Timestep._pos a.k.a. Timestep.positions. Hovever, all functions performing computations on positions at some point obtain their data from an AtomGroup. To be precise, they access AtomGroup.positions, and that looks like this:

@property
def positions(self):
    return self.universe.trajectory.ts.positions[self.ix]

Note the fancy indexing with the index array self.ix! According to the numpy docs, fancy indexing does not preserve the memory layout (in fact, there are no guarantees made):

The memory layout of an advanced indexing result is optimized for each indexing operation and no particular memory order can be assumed.

So, even for all positions in a Universe, you get

>>> print(u.atoms.positions.flags)
C_CONTIGUOUS : True
F_CONTIGUOUS : False
...

This means that due to fancy indexing, numpy cannot simply return a view (or make a simple memcpy), but instead reorders positions to be C-contiguous while copying.

That is exactly the reason why I want to change the memory-layout of the coordinates.base.Timestep._pos buffer: Instead of transposing data on every single access, it should be better (I think) to do that exactly once per read. Actually, most trajectory formats store their positions as C-contiguous arrays anyway, so that's only an issue with DCD trajectories.

When you said that

The FORTRAN order is chosen because it fits to the compute intensive applications we have.

I thought you were referring to functions acting directly on Timestep._pos. To my knowledge, there exists no such function, so I wondered what that could be. Since I couldn't find any other explanation for the 'F' order, I assumed it was (historically) chosen to speed up DCD reading.

Now I hope that this was enough to convince you :-)

zemanj commented 5 years ago

To my knowledge, there exists no such function

Turns out I was wrong. There exist ~three~ four such functions:

The _single_frame() method in analysis.rms.RMSD
transformations.translate()
transformations.center_in_box()
transformations.rotateby()

So these are the test candidates then. I'll try to priovide some numbers ~tomorrow~.

zemanj commented 5 years ago

@tylerjereddy Thanks for the hint! I wasn't aware of the __array_function__ interface. That might offer interesting possibilities such as automatic / hidden PBC handling. I'm unsure if complexity hiding is a good idea in that particular case, but the interface sounds very interesting nonetheless. I especially like the fall-through behavior.

zemanj commented 5 years ago

I measured the runtimes of the affected functions

transformations.translate()
transformations.rotateby()
transformations.center_in_box()
analysis.rms.RMSD.run()

Test code

import numpy as np
import MDAnalysis as mda
from MDAnalysis.analysis.rms import RMSD
from MDAnalysis import transformations
from timeit import timeit

u = mda.Universe("topol.tpr", "traj_comp.xtc")
ts = u.trajectory[0]
translate = transformations.translate([0, 0, 0])                                                                                                                                                                                                                        
rotate = transformations.rotateby(90, [1, 1, 1], point=[0, 0, 0])                                                                                                                                                                                                       
center_in_box = transformations.center_in_box(u.atoms[:10], wrap=True)
rmsd = RMSD(u.select_atoms('molnum 1'), groupselections=['molnum 1'])

n_runs = 100000
t = timeit(lambda: translate(ts), number=n_runs) / n_runs * 1.0e6
print("translate: {:.1f} µs".format(t))

t = timeit(lambda: rotate(ts), number=n_runs) / n_runs * 1.0e6
print("rotate: {:.1f} µs".format(t))

t = timeit(lambda: center_in_box(ts), number=n_runs) / n_runs * 1.0e6
print("center_in_box: {:.1f} µs".format(t))

n_runs = 1000
n_frames = 10
t = timeit(lambda: rmsd.run(stop=n_frames), number=n_runs) / (n_runs * n_frames) * 1.0e6
print("RMSD.run: {:.1f} µs".format(t))

Results

Fortran order

translate: 6.5 µs
rotate: 153.1 µs
center_in_box: 107.0 µs
RMSD.run: 1100.2 µs

C order

translate: 65.1 µs
rotate: 255.4 µs
center_in_box: 227.3 µs
RMSD.run: 1343.9 µs

So it appears that changing the memory layout to 'C' drastically detoriates the performance of these functions:

transformations.translate(): +901% (58.6 µs slower) (what the ...?!)
transformations.rotateby(): +66.8% (102.3 µs slower)
transformations.center_in_box(): +112.4% (120.3 µs slower)
analysis.rms.RMSD.run(): +22.2% (243.7 µs slower)

I wondered how this could be, especially since transformations.translate() essentially does nothing but ts.positions += vector. I would have expected that operation to be at most two times slower with 'C' order than with 'F' since it might happen that the array has to be fetched from memory three times instead of once. I really don't know what numpy is exactly doing here, but it is certainly doing it very slowly.

I therefore introduced a cython function lib.util.coords_add_vec(coordinates, vector) which replaces the slow numpy plus-equals operations. It takes a float32 coordinate array and a vector of arbitrary dtype as input (non-float32 dtypes are converted to float64, the addition is then performed in double precision).

C order with cython add

translate: 18.5 µs
rotate: 134.3 µs
center_in_box: 95.3 µs
RMSD.run: 1065.0 µs

Finally, the impact of C-order vs F-order on the runtimes looks like this:

transformations.translate(): +185% (12 µs slower)
transformations.rotateby(): -12.3% (18.8 µs faster)
transformations.center_in_box(): -10.9% (11.7 µs faster)
analysis.rms.RMSD.run(): -3.2% (35.2 µs faster)

Judging from the absolute impact, C order with Cython add is faster than F order. To allow a fair comparison, I also wrote a cython function that performs the addition for F-contiguous position arrays (of course with correct loop order) but that did exactly nothing compared to numpy summation.

kain88-de commented 5 years ago

tl;dr: Overall performance gains look good to me. Compare on our standard hardware with our standard benchmark suite.

Thank you for doing the benchmarks. They would make good additions to our benchmark library. It would also be nice to share the universe and give some sizing information (Or fill an empty universe). We might have influences here from numpy and lapack versions!

I would have expected that operation to be at most two times slower with 'C' order than with 'F' since it might happen that the array has to be fetched from memory three times instead of once.

No, the array is fetched a lot more and fetching from memory is slow. You also have to keep in mind that numpy will call a lapack function here. So the final speed depends on your CPU and lapack installation. On conda, the default is the mkl. Personally, I can see how a Fortran order code can be faster in this case. The register only needs to hold one value of vector for the current axis and all the rest can be used for values from ts.positions this minimizes memory reads and maximizes throughput.

The final results look good to me. I would be nice to have them as a standard benchmark, then we can run them on a machine from @orbeckst against several numpy and python versions.

zemanj commented 5 years ago

Some benchmarks of AtomGroup.wrap() comparing C and Fortran memory layout of Timestep.positions. The code of AtomGroup.wrap() is the same as in PR #2211 (except for the memory layout in case of Fortran order), as well as the test code/system/data (see this comment).

tl;dr Changing the memory layout from F- to C-contiguous can have a large positive impact on in-place wrapping for sliceable AtomGroups, whereas the performance gain for AtomGroups with scrambled indices is negligible.

contiguous AtomGroup with 15000 atoms (`u.atoms`)

compound	center	inplace	µs / frame (C order)	µs / frame (F order)	speedup factor (F/C order)
atoms	n/a	False	199.6	213.4	1.07
atoms	n/a	True	49.8	394.5	7.92
group	com	False	238.2	253.4	1.06
group	com	True	85.3	434.5	5.09
group	cog	False	206.4	223.2	1.08
group	cog	True	56.0	404.3	7.22
segments	com	False	489.3	547.3	1.12
segments	com	True	339.9	730.2	2.15
segments	cog	False	457.6	505.7	1.11
segments	cog	True	299.7	688.6	2.30
residues	com	False	1198.2	1776.2	1.48
residues	com	True	1035.7	1961.3	1.89
residues	cog	False	1147.2	1722.6	1.50
residues	cog	True	994.9	1907.1	1.92
molecules	com	False	843.2	1383.4	1.64
molecules	com	True	690.1	1567.2	2.27
molecules	cog	False	803.5	1337.9	1.67
molecules	cog	True	650.1	1524.7	2.35
fragments	com	False	829.2	1369.1	1.65
fragments	com	True	674.4	1552.6	2.30
fragments	cog	False	788.2	1322.1	1.68
fragments	cog	True	635.7	1505.3	2.37
average	com/cog	False	654.6	968.6	1.48
average	com/cog	True	501.1	1151.8	2.30
average	com/cog	False/True	577.8	1060.2	1.83

non-contiguous AtomGroup with 15000 atoms (`u.atoms[::-1]`)

compound	center	inplace	µs / frame (C order)	µs / frame (F order)	speedup factor (F/C order)
atoms	n/a	False	197.9	211.6	1.07
atoms	n/a	True	176.8	387.4	2.19
group	com	False	233.3	248.9	1.07
group	com	True	140.6	425.5	3.03
group	cog	False	205.5	223.4	1.09
group	cog	True	112.7	407.0	3.61
segments	com	False	479.4	550.4	1.15
segments	com	True	386.6	734.5	1.90
segments	cog	False	440.4	509.4	1.16
segments	cog	True	348.1	692.5	1.99
residues	com	False	1183.0	1778.1	1.50
residues	com	True	1079.9	1980.3	1.83
residues	cog	False	1136.6	1726.2	1.52
residues	cog	True	1040.0	1916.5	1.84
molecules	com	False	854.0	1387.3	1.62
molecules	com	True	766.6	1564.0	2.04
molecules	cog	False	816.3	1334.7	1.64
molecules	cog	True	726.9	1525.6	2.10
fragments	com	False	840.0	1367.1	1.63
fragments	com	True	753.1	1548.9	2.06
fragments	cog	False	798.9	1325.0	1.66
fragments	cog	True	709.5	1502.3	2.12
average	com/cog	False	653.2	969.3	1.48
average	com/cog	True	567.3	1153.1	2.03
average	com/cog	False/True	610.3	1061.2	1.74

scrambled AtomGroup with 15000 atoms (`u.atoms[shuffled_indices]`)

compound	center	inplace	µs / frame (C order)	µs / frame (F order)	speedup factor (F/C order)
atoms	n/a	False	201.7	212.5	1.05
atoms	n/a	True	378.4	392.5	1.04
group	com	False	238.7	251.7	1.05
group	com	True	411.8	439.7	1.07
group	cog	False	217.0	220.5	1.02
group	cog	True	383.6	399.0	1.04
segments	com	False	538.4	542.0	1.01
segments	com	True	705.2	723.9	1.03
segments	cog	False	489.3	501.1	1.02
segments	cog	True	670.7	681.6	1.02
residues	com	False	1771.1	1764.9	1.00
residues	com	True	1940.2	1954.4	1.01
residues	cog	False	1719.7	1717.0	1.00
residues	cog	True	1892.0	1902.4	1.01
molecules	com	False	1377.8	1385.3	1.01
molecules	com	True	1546.2	1568.2	1.01
molecules	cog	False	1323.1	1346.6	1.02
molecules	cog	True	1494.7	1522.7	1.02
fragments	com	False	1359.3	1369.3	1.01
fragments	com	True	1528.4	1560.1	1.02
fragments	cog	False	1319.2	1330.0	1.01
fragments	cog	True	1485.6	1506.7	1.01
average	com/cog	False	959.6	967.4	1.01
average	com/cog	True	1130.6	1150.1	1.02
average	com/cog	False/True	1045.1	1058.7	1.01

zemanj commented 5 years ago

@kain88-de Thank you for your comment! Sorry it took me so long to reply. I have a couple of comments/questions:

Thank you for doing the benchmarks. They would make good additions to our benchmark library. It >would also be nice to share the universe and give some sizing information (Or fill an empty universe). We might have influences here from numpy and lapack versions!

Good point! I'll check how to best include my benchmarks in our ASV benchmark suite. Regarding influences of different numpy/lapack versions, I don't expect any differences (see my comments/questions below).

I would have expected that operation to be at most two times slower with 'C' order than with 'F' since it might happen that the array has to be fetched from memory three times instead of once.

No, the array is fetched a lot more and fetching from memory is slow.

I'm a bit baffled that this is the case for you... on my machine, when executing coordinates += vector (shapes (n, 3) and (3,), respectively), numpy fetches both arrays exactly once, independent of memory layout and the value of n. The slowdown is caused by numpy repeatedly filling an intermediate 32K buffer with data from vector, which takes place in l1d/L2 cache (except for the first touch of the buffer - that of course requires committing physical main memory). So, on my machine, the slowdown is caused by one additional touch of main memory, n in-cache __memmove_ssse3()s and probably (haven't checked) a lot of unnecessary L1d misses due to numpy's choice of buffer size (32K is equal to the size of most L1d caches and thus, the buffer needs to be moved between L1d and L2).

Now I'm curious why the arrays are fetched multiple times in your environment... I checked with numpy 1.10.4, 1.16.2, both versions tested with conda and pip installs. Do you have any kind of special setup? It's important for me to know, because if numpy behaves differently in different environments, my optimization attempts will probably be futile...

You also have to keep in mind that numpy will call a lapack function here. So the final speed depends on your CPU and lapack installation. On conda, the default is the mkl.

As I mentioned above, I tried different versions of numpy installed both via pip and via conda. In all cases, no lapack functions are called. Instead, numpy uses its own SSE code (sse2_binary_add_FLOAT(), generated from this "template"). Could you tell me which pip/conda/custom build you have installed? Or maybe I missed something?

Personally, I can see how a Fortran order code can be faster in this case. The register only needs to hold one value of vector for the current axis and all the rest can be used for values from ts.positions this minimizes memory reads and maximizes throughput. I agree! The difference shouldn't be a factor of 10, though. That's just due to the inefficient way numpy handles broadcasting in the C-contiguous case...

So in the end, I might be better off fixing numpy instead of implementing workarounds in MDAnalysis. What do you think?

P.S.: In case you want to take a look, here's the relevant gdb output (stripped down to the interesting parts):

$ gdb python3.6
(gdb) run
>>> import numpy as np
>>> pos = np.zeros((15000, 3), dtype=np.float32)
>>> vec = np.zeros(3, dtype=np.float32)
>>> print("0x{:x}".format(pos.__array_interface__["data"][0]))
0xf90d40
>>> print("0x{:x}".format(vec.__array_interface__["data"][0]))
0xb894e0
>>> ^C
(gdb) break ufunc_generic_call
(gdb) cont
>>> pos += vec

Thread 1 "python3.6" hit Breakpoint 1, ufunc_generic_call (ufunc=0xb54b60, args=(<numpy.ndarray at remote 0x7ffff7f012b0>, <numpy.ndarray at remote 0x7ffff67fca30>, <numpy.ndarray at remote 0x7ffff7f012b0>), kwds=0x0)
(gdb) break alloc_perturb
(gdb) cont

Thread 1 "python3.6" hit Breakpoint 2, _int_malloc (av=av@entry=0x7ffff7dcfc40 <main_arena>, bytes=bytes@entry=32768)
(gdb) step 4
alloc_perturb (n=32768, p=0xfbcc70 "\030", <incomplete sequence \374>)
(gdb) del 2
(gdb) break sse2_binary_add_FLOAT
(gdb) break _contig_to_contig
(gdb) break _mm_add_ps
(gdb) cont

Thread 1 "python3.6" hit Breakpoint 4, _contig_to_contig (dst=0xfbcc70 "\030", <incomplete sequence \374>, __NPY_UNUSED_TAGGEDdst_stride=4, src=0xb894e0 "", __NPY_UNUSED_TAGGEDsrc_stride=4, N=3, src_itemsize=4, __NPY_UNUSED_TAGGEDdata=0x0)
(gdb) ignore 4 15003
(gdb) cont

Thread 1 "python3.6" hit Breakpoint 3, run_binary_simd_add_FLOAT (dimensions=<optimized out>, steps=<optimized out>, args=<optimized out>)
(gdb) step
sse2_binary_add_FLOAT (n=8192, ip2=0xfbcc70, ip1=0xf90d40, op=0xf90d40)
(gdb) cont

Thread 1 "python3.6" hit Breakpoint 5, sse2_binary_add_FLOAT (n=8192, ip2=0xfbcc70, ip1=0xf90d40, op=0xf90d40)
(gdb) step
_mm_add_ps (__B=..., __A=...)
(gdb) p &__B
$1 = (__m128 *) 0xfbcc70
(gdb) p &__A
$2 = (__m128 *) 0xf90d40
(gdb) ignore 5 11249
(gdb) cont

Thread 1 "python3.6" hit Breakpoint 3, run_binary_simd_add_FLOAT (dimensions=<optimized out>, steps=<optimized out>, args=<optimized out>)
205 numpy/core/src/umath/simd.inc.src: No such file or directory.
(gdb) step
sse2_binary_add_FLOAT (n=8192, ip2=0xfbcc70, ip1=0xf98d40, op=0xf98d40)
(gdb) cont

Thread 1 "python3.6" hit Breakpoint 3, run_binary_simd_add_FLOAT (dimensions=<optimized out>, steps=<optimized out>, args=<optimized out>)
205 in numpy/core/src/umath/simd.inc.src
(gdb) step
sse2_binary_add_FLOAT (n=8192, ip2=0xfbcc70, ip1=0xfa0d40, op=0xfa0d40)
(gdb) cont

Thread 1 "python3.6" hit Breakpoint 3, run_binary_simd_add_FLOAT (dimensions=<optimized out>, steps=<optimized out>, args=<optimized out>)
(gdb) step
sse2_binary_add_FLOAT (n=8192, ip2=0xfbcc70, ip1=0xfa8d40, op=0xfa8d40)
(gdb) cont

Thread 1 "python3.6" hit Breakpoint 3, run_binary_simd_add_FLOAT (dimensions=<optimized out>, steps=<optimized out>, args=<optimized out>)
(gdb) step
sse2_binary_add_FLOAT (n=8192, ip2=0xfbcc70, ip1=0xfb0d40, op=0xfb0d40)
(gdb) cont

Thread 1 "python3.6" hit Breakpoint 3, run_binary_simd_add_FLOAT (dimensions=<optimized out>, steps=<optimized out>, args=<optimized out>)
(gdb) step
sse2_binary_add_FLOAT (n=4040, ip2=0xfbcc70, ip1=0xfb8d40, op=0xfb8d40)
(gdb) cont
>>> ^D
[Inferior 1 (process 1662057) exited normally]
(gdb) info breakpoints
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x00007ffff5dd4240 in ufunc_generic_call at numpy/core/src/umath/ufunc_object.c:4671
    breakpoint already hit 1 time
3       breakpoint     keep y   0x00007ffff5db1111 in sse2_binary_add_FLOAT at numpy/core/src/common/lowlevel_strided_loops.h:435
    breakpoint already hit 6 times
4       breakpoint     keep y   0x00007ffff5c47d40 in _contig_to_contig at numpy/core/src/multiarray/lowlevel_strided_loops.c.src:329
    breakpoint already hit 15004 times
5       breakpoint     keep y   <MULTIPLE>         
    breakpoint already hit 11250 times
(gdb) quit

zemanj commented 5 years ago

@kain88-de Maybe easier to follow than plain gdb output: numpy_iadd_call_graph

Not sure why kcachegrind shows __memcpy_avx_unaligned_erms where gdb gave me __memmove_ssse3, but that's probably irrelevant. It's nice to see that FLOAT_add really takes only the expected 11% of the total runtime, and interesting that the memcopies "only" account for 29%. Most of the runtime (~58%) seems to be due to additional overhead within PyArray_TransferNDimToStrided (~46%) and _contig_to_contig (~12%).

kain88-de commented 5 years ago

note I don't have time for a longer answer until the weekend.

I didn't benchmark on my machine. I made assumptions based on my knowledge of CPUs. I'm impressed you got valgrind to work on mdanalysis. I would be interested how you do it and deal with false positives.

Valgrind and gdb do not show the full picture here. They cannot show cache misses! To see cache misses you can use perf. A Standart tool from the Linux Kernel. Cache misses are a pain because the CPU appears busy when they occur. My guess is the slower version will show much more cache misses.

zemanj commented 5 years ago

@kain88-de

I didn't benchmark on my machine. I made assumptions based on my knowledge of CPUs.

That's good to know, I thought you were absolutely certain about what you said. I'd appreciate it if you could phrase sentences like "No, the array is fetched a lot more [...]" or "[...] keep in mind that numpy will call a lapack function here." a bit more defensively so that It is clear that these statements are not necessarily true, but something you conjecture from experience. Don't get me wrong, I'm really happy that you share your ideas! Your comments just caused a lot of confusion on my side.

I'm impressed you got valgrind to work on mdanalysis.

I guess the probems with false positives are with the cachegrind and memcheck tools. I just used the callgrind tool to get the call graph, and that seems to work flawlessly. AFAIK there are many attempts but so far no universal fix to sort out false positives when using the memcheck or cachegrind tools in conjunction with python. Would be really nice to have, though. Thanks for mentioning perf, it's been along time since I used it and I totally forgot that it even existed... will give it a shot soon.

zemanj commented 5 years ago

@kain88-de I followed your suggestion and checked for cache misses with perf. Against my intuition, although there seems to be a little room for improvement, the slow-down is mainly caused by the time it takes to move data around within cache (the unnecessary repeated buffer fills). If you want to know the details, check out this post (it's pretty boring, though).

kain88-de commented 5 years ago

Thanks for the check. So the buffer line to fill the L1/L2 cache with data is not fully utilized?

On Mon 15. Apr 2019 at 19:00, Johannes Zeman notifications@github.com wrote:

@kain88-de https://github.com/kain88-de I followed your suggestion and checked for cache misses with perf. Against my intuition, although there seems to be a little room for improvement, the slow-down is mainly caused by the time it takes to move data around within cache (the unnecessary repeated buffer fills). If you want to know the details, check out this post https://github.com/numpy/numpy/issues/13307#issuecomment-483332502 (it's pretty boring, though).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MDAnalysis/mdanalysis/issues/2210#issuecomment-483335201, or mute the thread https://github.com/notifications/unsubscribe-auth/ABA2OVQG7ZYNIKRDC2HJ6C3PQSYEPANCNFSM4G2XUZWQ .

zemanj commented 5 years ago

So the buffer line to fill the L1/L2 cache with data is not fully utilized?

I don't quite get your question... do you mean the intermediate 32KiB buffer? That is fully utilized, but way more often than necessary. And it has the same size as the L1d cache, which I thought might be sub-optimal...

kain88-de commented 5 years ago

I do not buy the argument that the 32KiB buffer is fully utilized every time. There is a bunch more L1 dcache misses using C order (about 4 times as many). Still, I'm surprised that this has a difference for C and Fortran order. But enough rambling about CPU's

Anyway, your proposed change seems fine to me. You took good care benchmarking the impact and show which functions are negatively affected. Since the effect is noticeable in analysis functions like RMSD I'm tending against this change (a benchmark for the contact analysis or water bridges would be a great add on). I'm not developing on a regular basis anymore though just take this as an advice.

In general, I like the change though. In most algorithms, we do a lot of selections for every frame. Did your RMSD benchmarks select the whole molecule or a sub-selection? If you only use a sub-selection the results might look different.

Another way forward would be to create a copy with Fortran order for low-level functions we identify as slow. It would mean we create a copy but could potentially mean we get out even performance wise.

MDAnalysis / mdanalysis

Change Memory layout of positions from Fortran- to C-contiguous #2210

Test code

Results

contiguous AtomGroup with 15000 atoms (u.atoms)

non-contiguous AtomGroup with 15000 atoms (u.atoms[::-1])

scrambled AtomGroup with 15000 atoms (u.atoms[shuffled_indices])

contiguous AtomGroup with 15000 atoms (`u.atoms`)

non-contiguous AtomGroup with 15000 atoms (`u.atoms[::-1]`)

scrambled AtomGroup with 15000 atoms (`u.atoms[shuffled_indices]`)