Smart parallelization / MPI communication overhead

fzeiser commented 3 years ago

@anderkve: Do you have any idea about the overhead of MPI for DE and Multinest? I was tring to run the computations of fram with using the MPI functionality and several cores. I did not get the speedup that I naivy expected (the calculations never finished, so I cannot say what the speedup/slowdown would be). However, I wondered

how MPI works here: I would assume that the live points are distributed, and the results need to be gathered before the next "round".
what the communication overhead is (both for same node / different nodes)-> find the optimal settings: If the likelihood calculation takes very little time, the overhead might be dominant, thus slowing down the computations.

If I did the math correctly, the current calculations take about ~100 ms / parameter point.

fzeiser commented 3 years ago

I might of course be that I changed several variables at a time at fram and the calculations do get a nice speedup through MPI -- at least to a certain level! (At some point of time, when the calculation per point is fast enough, the MPI latency is expected to kick our ass.)

-- So it might be that I made a mistake and should just try to resubmit / test the submission with two different settings...

anderkve commented 3 years ago

I would assume that the live points are distributed, and the results need to be gathered before the next "round"

Yes, of course with different definitions of "round" for DE and MultiNest. Typically each MPI process can gets multiple points to work through before it has to communicate results back.

Experience shows that DE should be able to scale much better than MN. (Good scaling for MN typically breaks down around O(100) MPI processes, I think -- possibly well before that.) But when calculations are very quick, MPI communication can indeed become a bottleneck. In general I'd expect that with more MPI processes you should be able to use more live points with reasonable scaling, while asking for higher convergence precision (effectively asking for more "rounds") would still increase computation time.

If many points are invalidated for some reason (meaning that a new point has to be sampled within the same "round"), then another bottleneck appears: towards the end of one round, all other MPI processes can end up waiting a significant time for the last MPI process to get through its designated number of valid points. Though, I don't think we end up invalidating that many points?

what the communication overhead is (both for same node / different nodes)-> find the optimal settings: If the likelihood calculation takes very little time, the overhead might be dominant, thus slowing down the computations.

This I don't really know, especially because it varies significantly from supercomputer to supercomputer. I assume none of our computations use threading (OpenMP)? So we should at least make sure to run with OMP_NUM_THREADS=1 and so use all the cores we ask for as MPI processes.

One important parameter when dealing with fast computations is the buffer for the file writing. If you are using the hdf5 printer, you can adjust the a buffer_length option in the YAML file. This controls how many valid points each MPI processes saves in memory before passing it to MPI process 0 for printing to file. Since with the hdf5 printer, only process 0 is writing to file, you can end in a situation where this becomes the bottleneck, if the the batches of points to write are too many. I think the default buffer_length is 1000, so you could try adjusting it, e.g.

  printer: hdf5
  options:
    buffer_length: 10000

You could check the timestamps for the line "Total lnL: " in some default.log_X files, to get a feel for how long a single MPI process takes to gett through some small number of points. That will give you some clue as to how frequent each process would need to pass its result in for file writing. Also, if you look at the time stamps for many such lines and notice that there are some long gaps, it likely means that you have encountered a situation where many MPI processes need to wait for the last one to finish its set of points. (Unless we actually have computations that can end up taking much longer than average for specific parameter values.)

An alternative is to switch to the old version of the hdf5 printer, by setting printer: hdf5_v1. Unfortunately then you can't adjust the buffer size, but on the other hand now each MPI process writes output to its own hdf5 file, so less chance of a bottleneck. (Though more independent disk writing, which can also be problematic.) In this case I'd recommend running with the printer option disable_combine_routines: true , which says that GAMBIT won't try to combine all the hdf5 files at the end of the run. (If a job times out while GAMBIT is doing this, you can end up with all hdf5 files being corrupted...) There is an old python2 script for manually combining hdf5 files in Printers/scripts/combine_hdf5.py which you can run like this:

python2 combine_hdf5.py <path-to-target-hdf5-file> <root group in hdf5 files> <tmp file 1> <tmp file 2> ...

anderkve commented 3 years ago

What scanner settings are you currently using? And how many free parameters?

anderkve / gambit_np

Smart parallelization / MPI communication overhead #9