Expose OpenMP backends to more analysis methods

scal444 commented 2 years ago

Is your feature request related to a problem?

Some analysis tools rely on underlying libraries that have both OpenMP and serial implementations, but only ever allow the serial implementation to run. InterRdf is a good example of this. In the main loop:

The tool calls pairs.capped_distance
pairs.capped_distance dispatches either to an neighbor search grid (serial) or a brute force method
The brute force implementation calls distance_array. distance_array has an input kwarg to determine serial or parallel execution, but this method is not invoked with that kwarg in this call sequence, and so always defaults to serial.

Describe the solution you'd like

Allow users to accelerate RDF and other routines with existing parallel implementations. A demo implementation (not ready for submisison) can be found on my fork here. Some local benchmarks on my Ryzen 5:

perf_comparison

I tuned the brute force thread count with the OMP_NUM_THREADS env variable while running asv. Isolating a benchmark with 2000 atoms, we get linear scaling of performance per openmp thread. Also of interest is the very poor scaling of the nsgrid implementation, but that's another issue (and in smaller benchmarks, nsgrid outperforms brute force for shorter cutoffs).

Describe alternatives you've considered

There are two questions here -

How should the ability to choose a backend be exposed to users?

This could be done test-by-test (see hacky example here), but it may be worth looking into something more standard.

Should mdanalysis try to dispatch to OpenMP by default if it exists?

OpenMP support is easily detected, and IIRC other MDAnalysis dependencies like Numpy already implement transparent multithreading for some routines.

Additional context

Looking for feedback / input.

orbeckst commented 2 years ago

Coming up with a consistent way to enable OpenMP acceleration would be good.

Note that we came across situations where the numpy OpenMP behavior decreased performance (in the context of the on the fly transformations IIRC) .

IAlibay commented 2 years ago

Generally I think the OpenMP backends need reviewing (I don't even think they get properly tested in CI).

I guess the main question I have here is; would weak scaling be a better goal here for future development? breaking things down to one thread per-frame seem like more likely to get us good scaling instead of multiple threads per frame?

I can see cases where allowing for both options could be useful though (hbond analysis maybe?)

orbeckst commented 2 years ago

I am adding a few related issues for context

2975 — need to be able to limit OpenMP threads (i.e., have good user control over the level of OpenMP parallelization)
2046 — problems with OpenMP in the distance calculation code
1670 — some older discussion about OpenMP in distances (basically: it gives some speed-ups but it's far from perfect, let's keep it for now)
PR #2950 — introduced threadpoolctl as dependency for MDAnalysis to be able to control OpenMP threads for numpy (to avoid severe performance penalties when cores are oversubscribed by default)

orbeckst commented 2 years ago

I guess the main question I have here is; would weak scaling be a better goal here for future development? breaking things down to one thread per-frame seem like more likely to get us good scaling instead of multiple threads per frame?

My experience is that there are limits to how well you can make "split-apply-combine" parallelization work. It's often better if you can get multiple nodes involved (on a parallel file system) and then it's quite useful if you can make use of the cores on the node. Being able to do some heterogenous parallelization is not a bad thing, in my opinion. Furthermore, even "normal" operations such as distance-based selections will benefit on modern multicore machines (essentially, "for free").

I can see cases where allowing for both options could be useful though (hbond analysis maybe?)

I think OpenMP-based acceleration (and GPU acceleration) has a place in MDAnalysis. Per-frame based analysis is harder to accommodate in a seamless manner, as we have seen with PMDA. In an ideal world, our Analysis classes are automagically parallel but we're not there yet. For the time being, using multiprocessing, dask, or MPI along the lines of User Guide : parallelizing analysis and PRACE Workshop: Day 3, Session 1 (pdf)/Practical: Parallelism is probably the easiest.

orbeckst commented 2 years ago

Do you have opinions on the distance calculations, @richardjgowers @hmacdope and parallelization?

hmacdope commented 2 years ago

First up thanks for having a look into this, improving performance is something we are really looking into at the moment. :)

I will also say that we are developing an intrinsics based explicitly SIMD vectorised package for calculating distances (https://github.com/MDAnalysis/distopia) which we are hoping may eventually replace some (most) of the hot distance code. Any additional input is most welcome and we would love additional people to contribute. We may also expand to CUDA and/or SYCL time permitting for some tasty heterogeneous parallelisation.

To combine a "split-apply-combine" approach with SIMD intrinsics would require a different division of labour as each thread will need <SIMD_WIDTH/sizeof(Type)> contiguous memory locations to work from, or <SIMD_WIDTH/sizeof(Type)> indices. I am sure that this is doable, but may possibly limit parallel efficiency as the SIMD width increases, as each thread must receive data in blocks of 16 floats for example if using AVX512 (although this may reduce false sharing I think).

For this reason, I think parallelising across the frames axis i.e. one thread per frame is the way to move forward but my parallel code experience is limited. I'm also unsure how this interacts with things like multiprocessing and dask. @orbeckst will know much more.

I do think that leveraging OpenMP as much as possible is a still a really good idea and worthwhile goal moving forward, as there are so many analyses that can benefit. 👍

scal444 commented 2 years ago

Thanks for all the responses!

Generally I think the OpenMP backends need reviewing (I don't even think they get properly tested in CI).

Acknowledged - the few that I've been playing around with are covered, but I understand if work is done there may be some groundwork/cleanup/test expansion to do first.

I will also say that we are developing an intrinsics based explicitly SIMD vectorised package for calculating distances (https://github.com/MDAnalysis/distopia) which we are hoping may eventually replace some (most) of the hot distance code. Any additional input is most welcome and we would love additional people to contribute. We may also expand to CUDA and/or SYCL time permitting for some tasty heterogeneous parallelisation.

Awesome! I'm interested, will take a look and see if I can help out. I'd be very interested in GPU implementations, that's another side project I was looking into for the main codebase anyway.

Furthermore, even "normal" operations such as distance-based selections will benefit on modern multicore machines (essentially, "for free").

Right, one of the big benefits of OpenMP is that it can really help local workloads while scaling alright to HPC level, hopefully transparent to the user. Another use case for in-frame parallelization is analysis where frames aren't independent of each other - such as mean squared displacement.

It seems overall there are several overlapping endeavors here - multiprocessing, multithreading, and SIMD (with GPUs hovering in the background). These things can all happily coexist if planned together, but can clash if not, so I'm not actually sure what to take away from all of this (useful) information. The project could maybe benefit from some centralized structures defining the parallelism scheme? I'm imagining each analysis tool could configure these settings, and choose from one or more compatible offload/parallelization schemes, overridable by users.

hmacdope commented 2 years ago

I agree that we could do with possibly formalising where each parallelism hierarchy fits into future plans. Would people be amenable to this @MDAnalysis/coredevs? Perhaps something like this already exists that I'm not aware of.

richardjgowers commented 2 years ago

Thanks for looking into this, I think maybe the benchmark is a little small (at 2k atoms?). We really need to be designing around large problem sizes, where smart algorithms (here nsgrid) are required.

That said...

Nsgrid is relatively young, so I'd happily believe it could be optimised.
The auto selection of method hasn't really been investigated, so perhaps better heuristics exist.

In terms of backend selection, I think this probably belongs as a trait of the MDAnalysis package, something like how you can tell matplotlib what backend to use? Rather than every single function call taking kwargs.

scal444 commented 2 years ago

Thanks for looking into this, I think maybe the benchmark is a little small (at 2k atoms?). We really need to be designing around large problem sizes, where smart algorithms (here nsgrid) are required.

Agreed, but NSgrid was crashing for me at higher atom counts, including the highest count on the current benchmark (10k). Also something I was planning to look into but haven't filed a bug yet.

In terms of backend selection, I think this probably belongs as a trait of the MDAnalysis package, something like how you can tell matplotlib what backend to use? Rather than every single function call taking kwargs.

That can work, but we'd need to carefully define / document the way analysis tools interact with that trait. It's infeasible to have every tool implemented in every backend, so if the trait says "GPU", a serial-only tool either needs to use its serial implementation or throw a useful error message. So at some level, tools or libraries have to state their capabilities.

scal444 commented 2 years ago

Another question is what's the team's priority right now? I started down the OpenMP improvement route, but I'd be happy to work on a mechanism that has momentum. Is there a roadmap for integrating the SIMD libraries, and/or implementing per-frame parallelization internally?

richardjgowers commented 2 years ago

Honestly, the crash at 10k seems a quite high priority to me :)

The development of the SIMD code is happening in the "distopia" repo as it is very experimental. Once it replicates the contents of lib.distances, theoretically you could slide it (or any other backend...) under most analysis (and core) functions that use lib.distances. Something like a BLAS for distance calculations.

richardjgowers commented 2 years ago

Oh and pmda is where (at least one direction) of per-frame parallelism development is happening, that's @orbeckst 's initiative.

scal444 commented 2 years ago

It looks like the crash has a bug already (assuming the same underlying issue) at #3183 . I found the crashing line, can try to figure out what's going on, seems like a good first issue.

I did see pmda, but it didn't look like there's been any active development in the last year or so, so I didn't know if it's an active project.

MDAnalysis / mdanalysis