Limit number of threads/cores in parallel calculations

CharlyEmpereurmot commented 4 years ago

Hello all,

I think it would be awesome to be able to limit the number of threads/cores used in a number of different functions calls. For example, while calculating bonds I would like to be able to do this:

pos_1_vec & pos_2_vec are the vectors of positions to calculate bonds in between
mda_backend = 'OpenMP'
my_bonds_vec = mda.lib.distances.calc_bonds(pos_1_vec, pos_2_vec, backend=mda_backend, box=None, **nb_threads=6**)

While atm when using mda_backend = 'OpenMP' all threads of the machine will be used, and using mda_backend = 'serial' will use a single thread. This can be annoying for executing code on clusters for example, or for making use of MDAnalysis for developing user-friendly tools.

It would be nice if calc_bonds, calc_angles, calc_dihedrals could have an argument nb_threads, but even better if all functions that are parallelized could have an argument like this, that goes together with backend.

Please correct me if I'm missing something, but I believe atm it's not possible to limit the number of threads easily from within the python code.

IAlibay commented 4 years ago

Thanks for raising this issue @CharlyEmpereurmot, it's a rather interesting one, and I agree really should have an easier solution.

Currently, I guess the easiest way to do this would be to export OMP_NUM_THREADS prior to calling python or doing os.environ["OMP_NUM_THREADS"] = "cores". However, that does limit things. It should be easy enough to pass a variable down to the c routines to set omp_set_num_threads() on the fly.

orbeckst commented 4 years ago

There's some discussion in PR #2950 – please make your voice heard!

richardjgowers commented 4 years ago

Honestly, I don't think the openmp backend is that good, last I checked it wasn't close to twice as fast with two cores. I think this is simply because there's a lot of extra work outside of that region that doesn't get parallelised. I'd sooner deprecate and remove the whole idea and instead invest more into pmda-like ideas, than add more features to "backend".

orbeckst commented 3 years ago

I am sure that the OpenMP code could be improved. But I think the real problem here is that pretty much all OpenMP code (in MDA and in numpy – see #2950) slows down when OpenMP uses more threads than physical cores. From my initial tests with

import threadpoolctl
import numpy
threadpoolctl.threadpool_info()

on every machine, the num_threads is set to physical cores + hyper threads of the machine. That's just stupid for performance of numerical code.

A sensible start would be to limit OpenMP threads as soon as you do serious work.

For MDAnalysis we should include max_threads keywords where we offer explicit or implicit parallelism (the latter is not always that obvious, see discussion in #2950).

orbeckst commented 3 years ago

@tylerjereddy is there a list of numpy functions that use OpenMP?

tylerjereddy commented 3 years ago

@orbeckst I imagine it would depend on the linear algebra backend in use (usually OpenBLAS for wheels; sometimes MKL from conda), but mostly linear algebra functions I think.

The only reference to OMP_NUM_THREADS I see is in doc/source/reference/global_state.rst in a section called Number of Threads used for Linear Algebra:

NumPy itself is normally intentionally limited to a single thread during function calls, however it does support multiple Python threads running at the same time. Note that for performant linear algebra NumPy uses a BLAS backend such as OpenBLAS or MKL, which may use multiple threads that may be controlled by environment variables such as OMP_NUM_THREADS depending on what is used. One way to control the number of threads is the package threadpoolctl <https://pypi.org/project/threadpoolctl/>_

yuxuanzhuang commented 3 years ago

Hi! Related to this, but more on the NumPy side, I recently encountered a weird issue with numpy.dot (that uses external libraries e.g. OpenBlas). With default settings in OpenBlas of num_threads = physical cores + hyper threads, the performance is worse than with single thread (https://gist.github.com/yuxuanzhuang/82e1e7b57d0cda80ac964d1cd138f618). A real-MDA-life case would be an analysis with on-the-fly transformation. So possible solutions I can think of now are a) limiting all MDAnalysis code to only use num_physical_cores threads. b) using numpy built with mkl which doesn't subscribe hyper threads. (I am not sure if it violates any MDA license?)

As for limiting thread numbers, threadpoolctl can change thread limit during runtime---in a context manager fashion. I have tried to use ContextDecorator for conversion but it doesn't seem to work as expected (https://gist.github.com/yuxuanzhuang/05b0da16a51f567e54f7f3f22591e316).

Side note: there are reports saying mkl works worse with AMD CPUs, but with my latest test (2020 version), it seems to perform as good as openblas.

orbeckst commented 3 years ago

Related to this, but more on the NumPy side, I recently encountered a weird issue with numpy.dot (that uses external libraries e.g. OpenBlas). With default settings in OpenBlas of num_threads = physical cores + hyper threads, the performance is worse than with single thread (https://gist.github.com/yuxuanzhuang/82e1e7b57d0cda80ac964d1cd138f618). A real-MDA-life case would be an analysis with on-the-fly transformation.

My understanding of your gist was that np.dot() influences the performance of the code that comes after it. If OpenMP threads are oversubscribed then just normal Python code in the following line is slowed down. — Correct me if I am wrong.

orbeckst commented 3 years ago

b) using numpy built with mkl which doesn't subscribe hyper threads. (I am not sure if it violates any MDA license?)

mkl is included in conda so using it is not a problem. However, installation and dependencies are difficult enough so the less we prescribe the better. I would not want to say "you can only use MDA if you use a numpy that is linked against MKL". Besides, if this is not a MDA problem then we shouldn't have to bend over backwards and inconvenience our users. Rather, we should try and have the issue fixed upstream from us.

As a short term solution, we could test if OpenMP threads are set to a low performance setting and warn users. (I did find out that most of my students routinely set the OMP_NUM_THREADS environment variable to 1 before they do any serious work... they were not surprised when I told them about this issue, it just never filtered up to me.)

Does it make sense to add threadpool limitation to specific pieces of MDAnalysis where we suspect that we can get into performance issues? Or does this just make the code more complicated??

It would be useful to hear what different people (users, developers) think about this issue.

IAlibay commented 3 years ago

However, installation and dependencies are difficult enough so the less we prescribe the better. I would not want to say "you can only use MDA if you use a numpy that is linked against MKL".

Agreed, plus this would immediately kill off arm64 & power support (neither of which I believe are supported, with mkl being x86 specific). There's also a lot of talk about MKL being not so optimized on AMD chips...

As a short term solution, we could test if OpenMP threads are set to a low performance setting and warn users.

Going to be a little bit controversial and say that personally, I'd be wary of including a warning here. I feel like a user warning that will get triggered all the time on most workstations is just another thing that will push users away from properly reading warnings. Essentially I think that maybe MDA warnings should be primarily reserved for assumptions and behaviours that MDA does that could lead to erroneous results, and whilst poor performance is annoying, it doesn't necessarily fit in that category?

That being said, I'm somewhat curious as to how we'd implement such a warning, I guess psutil?

Does it make sense to add threadpool limitation to specific pieces of MDAnalysis where we suspect that we can get into performance issues?

My vote would be more for this. My understanding of @yuxuanzhuang's benchmarking is that transformation code is always faster when executed serially? In that case using a context manager for all of that would make sense (note; we probably should test this own on a low clock rate CPU and see if the benchmarks hold up, I'm currently running a 1.8 GHz boost disabled mobile chip, so I can probably try it out this weekend if needed).

most of my students routinely set the OMP_NUM_THREADS environment variable to 1 before they do any serious work

More related to @CharlyEmpereurmot's original post here, setting OMP_NUM_THREADS to cores per task is usually the recommended way of running any numpy-centric code on clusters, especially if you end up sharing nodes. Although having done my fair share of sysadmin work, I realise that users following this isn't always the case. Given @richardjgowers' distopia code and the poor performance of the existing OpenMP C code, I wouldn't be against just getting rid of that.. but it won't fix the numpy side of things.

richardjgowers commented 3 years ago

@orbeckst mkl is only in anaconda's channel (the default channel) not conda-forge. So most people installing via this route don't have mkl. I also don't think that trying to get specific about a numpy backend will end well.

MDAnalysis / mdanalysis

Limit number of threads/cores in parallel calculations #2975