memory usage of streaming reduction

taobrienlbl commented 4 years ago

In #396, streaming mode was enabled for the table reduction stage in the Bayesian AR detector algorithm. Analysis of the profiler output shows that the memory usage scales linearly with two factors: the number of threads, and the number of rows in the parameter table. This limits the number of table rows that can be utilized: out-of-memory issues eventually occur.

Profiler output (see below) shows that memory usage increases linearly during the reduction stage. My initial hypothesis is that memory builds up because the reduction thread cannot keep up with the output generated by all of the working threads.

Profiler output from running on 8192 table rows

We should analyze this in more depth to understand where the memory usage is coming from. If the above hypothesis is correct, @burlen suggested that we could multi-thread the reduction.

When we get around to testing this, a large version of the BARD parameter file, with a couple million rows, is available at NERSC: /global/project/projectdirs/m1517/cascade/taobrien/teca_bayesian_ar_detector/mcmc_training/brute_force_analysis/teca_bruteforce_install/TECA/alg/teca_bayesian_ar_detect_parameters.cxx (I would attach here, but it is over over 200MB.)

burlen commented 4 years ago

I agree that Travis's hypothesis is a factor in the OOM condition. Multi-threading the reduction stage makes sense.

While there is an imbalance the capacity to generate and reduce the data, it may also be necessary to limit the number of threads generating data to be reduced. For instance one could set the the thread_pool_size to a multiplicative factor of the number of threads used in the reduction. Eg. 2 generator threads per reduction thread. This factor will be problem dependent.

Note that despite the reduce being single threaded one could limit the number of generating threads on the command line. For instance in the above run, one could potentially avoid the OOM by reducing the number of generator threads from 32 down to 16 at the expense of run time.

It may also be necessary to block the generating threads once some number of data is produced to avoid the OOM.

burlen commented 4 years ago

All of the above assumes that a given rank has the resources to reduce a time step. when the table gets large enough it may not be the case. another approach would be to split the reduction (applied to each time step) over multiple MPI ranks. One could carve the communicator into groups of N where N ranks will cooperate to reduce each time step.

burlen commented 1 year ago

This should have been improved by bf99eff3

LBL-EESA / TECA

memory usage of streaming reduction #402