Can I pass heavy-data to my Likelihood function across parallelized samplers?

JHayoz commented 3 years ago

Hi all!

I'm running data-heavy exoplanet atmospheric retrievals with PyMultinest (and will soon update to UltraNest). For that purpose, I have a computation-heavy likelihood function which calculates a forward model using a huge database (used by petitRADTRANS, see Mollière A&A 627, A67 (2019)). I therefore need to parallelize the sampling process across, say, N cores (or equivalently N samplers each on its own processor). At the moment, I run my code using a command-line like mpiexec -n N python retrieval.py, which unfortunately loads the huge database N times, resulting in a gigantic RAM-usage (scaling like ~N). I would like to be able to load this database only one time, and share it across all N parallel cores. As far as I understand, a possibility to do that would be to pass this database as an argument (or kwarg) to the likelihood function (maybe by pickling it like in emcee? see Foreman-Mackey 2013 arXiv:1202.3665v4, https://emcee.readthedocs.io/en/stable/tutorials/parallel/#parallel). Another possibility would be to work with multithreading instead of parallelisation, and then have each thread refer to the same database (again by passing some argument to the likelihood function).

A bit more precisely: this huge database is encoded as a self-written python class, which I would love to be able to define once at the start of the retrieval, and pass it each time the likelihood function is evaluated.

Is either of those possiblities already possible with UltraNest, or being planned to be added? Or is there another option doing the same thing (spare some RAM, but not at the cost of reading data directly to a disc-drive)?

I would be happy to talk directly to Johannes Buchner, or anyone else working on this project, to give further details.

Thank you for your answer!

JohannesBuchner commented 3 years ago

I presume you have already tried to reduce the memory footprint and del'd and garbage collected any temporary objects.

If memory is a limitation, then MPI parallelisation is not the right solution, because the processes are more or less siloed from each other.

What I would do is to enable ultranest's vectorization feature and write a likelihood&transform which can accept and process many points, and do the parallelisation there (for example, with threading or similar).

I'd still try to run as many processes as possible within the memory constraints (e.g., 2), because they can work independently. It can help to add swap space, because most of the database will probably be unused and can be dormant on the disk.

Another solution: just get a bigger machine.

JHayoz commented 3 years ago

Thanks for your answer! Yes my jobs have as little memory usage as they need, and the class I'm using can't be del'd since their initialization is very costly and they are called at every iteration.

Vectorization is unfortunately not really an option. The likelihood function is as complicated as it gets: the main part requires FORTRAN computations in the petitRADTRANS radiative transfer routine, after which I need complicated functions from different packages (e.g. spectRES, PyAstronomy). Modifying those packages to allow vectorization is worth as much as buying a new bigger machine (from a salary point of view). But the goal is actually to use an existing machine which has very little RAM, however a LOT of CPUs, which could help a lot if all the processes could share the same database (i.e. if it could be called by reference from the RAM rather than having it loaded on each parallel CPU). I don't think that using swap space can help here, because I think the swap space is set as constant on said machine and it is too small. Moreover, I would like to avoid having to read the database from the disc drive at each evaluation of the likelihood function because it would be really slow, on top of requiring a major update of how petitRADTRANS loads and works with the database.

But you see what I'm conceptualising, right? The requirement would be to start one instance of an UltraNest sampler on one CPU, where I also load the big database. Then I would start a parallel pool party (using MPI or Schwimmbad or whatever) or start multiple threads at once (say 40), which would all evaluate the likelihood function, and receive the big database (which was loaded once into the RAM) by some sort of call by reference. See the (really awful) concept drawing in the pdf.

I really don't know how feasible that is, but it is at least thinkable, right? Is that possible with UltraNest at the moment, or could it be implemented?

Here's the awful drawing: retrieval_architecture.pdf

Cheers!

rhayes777 commented 3 years ago

We had sort of related issue with other optimisers where spawning processes to evaluate a likelihood function with a lot of data would take a long time. The issue is that all the data from the parent process gets copied to the child process; I imagine this would make the situation even worse in your case as you'd have the duplicate data filling RAM and you would have to wait for it to be copied over.

I don't think threads necessarily solve the issue thanks to Python's Global Interpretter Lock? You would only have one copy of the data but you probably wouldn't get any speed up over performing evaluations in serial.

JohannesBuchner commented 3 years ago

@JHayoz: You don't need to vectorize all sub functions, what I was suggesting is that you start your threads with multiprocessing or whatever in the vectorized likelihood to farm out the multiple proposed points to workers. These each evaluate one point's likelihood (the likelihood you have now).

@rhayes777, child processes nominally "copy" the parent process memory, but actually the same memory pages are referenced until they are written. So children actually do not need more memory for read-only data. At least this is true on linux.

rhayes777 commented 3 years ago

@JohannesBuchner That's interesting. I guess the specific issue we were dealing with was that the likelihood function and data was being mapped to existing processes. I think in that case the data was being pickled and passed via a queue to the child process?

JohannesBuchner commented 3 years ago

Maybe if you use complicated frameworks -- I was only talking about the fork(2) primitive, see https://stackoverflow.com/questions/32130868/does-fork-duplicate-all-the-memory-of-the-parent

JHayoz commented 3 years ago

@JHayoz: You don't need to vectorize all sub functions, what I was suggesting is that you start your threads with multiprocessing or whatever in the vectorized likelihood to farm out the multiple proposed points to workers. These each evaluate one point's likelihood (the likelihood you have now).

So that would solve the RAM issue, right? But then the children processes would only be working in the sub-routines of the likelihood function, when the work is vectorized, is that so? But then, assuming that only a portion p of the total processes of the likelihood function is vectorized, it would lead to the consequence that the other 1-p processes all sit idle during un-vectorizable processes. Further assuming that p = 50%, which is the highest portion of those routines that I can maybe vectorize, then it basically means that my computations will need 50% more time to compute using the same number of threads in this way as I would with parallel cores (assuming that I have enough RAM to do so). Is that a fair assessment, or did I misunderstand the idea?

JohannesBuchner commented 3 years ago

Here is the idea:

from joblib import delayed, Parallel

with Parallel(n_jobs=-1) as parallel:
     def new_likelihood(points):
          likelihoods = parallel(delayed(old_likelihood)(p) for p in points)
          return np.asarray(list(likelihoods))

    # instead of sampler = ReactiveNestedSampler(old_likelihood, ...)
    sampler = ReactiveNestedSampler(new_likelihood, ..., vectorized=True)
    ....

You'll typically receive hundreds of points (can be controlled with ndraw_min, ndraw_max), so you should not have a problem keeping your threads/processes/... busy. You'll need to choose a sensible backend at runtime (probably threading, giving your constraints).

No need to rewrite legacy code.

docs: https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html https://joblib.readthedocs.io/en/latest/parallel.html

JHayoz commented 3 years ago

Thanks a lot! This looks really good, and is an easy fix compared to trying to vectorize each sub-routine of my likelihood function. I will give it a good try and close the issue later if it works.

JohannesBuchner commented 3 years ago

Please reopen if this is still an issue.

JohannesBuchner / UltraNest

Can I pass heavy-data to my Likelihood function across parallelized samplers? #36