Improve parallelisation

As part of the project, we discussed how to improve parallelisation, in particular

replace joblib parallelisation with Dask and distributed
replace slurm submit scripts with Dask-Jobqueue

In rp-bp (pbiotools), we use joblib for simple parallelism in many places, and most problems are embarrassingly parallel, except maybe this one. Internally, we call joblib.Parallel(...). All wrappers are under pbiotools/misc/parallel.py. There is a draft module pbiotools/misc/dask_utils.py, but it is not currently used (and may not be up-to-date).

In addition, many scripts have the option to be submitted (possibly with dependencies) to SLURM, using slurm.check_sbatch(...). Options handling/logging, etc. is handled in pbiotools/misc/slurm.py.

There are obviously many questions (at least I have):

Do we keep joblib? i.e does Dask replaces it? I've seen examples where both are used, and we can connect joblib to the Dask backend... In general, our data fits into RAM into pandas data-frames, but we might benefit from distributed computing with standard pandas operations, etc. I have not really used Dask. Ideally we would like to benchmark improvements.
For the parallel Stan calls, and in particular this long-running issue, there are many things to consider. We currently use multiprocessing.RawArray that does not come with a lock, and global variables (I don't even know if joblib shared-memory semantics and threading was ever tried, or if this would make sense?) But now that we use CmdStanPy, things might be different. CmdStan has its own mechanism for multi-chain parallelization, which uses the Intel TBB scheduler for threading. CmdStanPy and Dask have different designs that might not work well together, I'm not sure how this currently works out with joblib... Maybe we can have a look at this?
Using Dask-Jobqueue seems interesting, although I have never used it. I guess we would need some config, but it would be more involved if we want to generalize to different clusters (I guess we could stick to SLURM?). We need to make sure this provides at least as much functionality as our slurm module, and ideally benchmark both.

dieterich-lab / rp-bp

Improve parallelisation #130