Add parallelization to the summarizers

sschmidt23 commented 2 years ago

The existing summarizer codes currently assume that you will load all of the data at once and use only one process to create the summarized ensemble. Three of the four existing summarizers (NZDir, NaiveStack, PointEstimateHist) have a major aspect that easily leads itself to be parallelizable (not so sure on varInference and the iterative determination of the distribution, that seems a bit trickier): PointEstimateHist and NZDir are simply adding points to a histogram NaiveStack is just adding to a gridded parameterization the sum of individual PDFs evaluated on said grid.

So, a method that takes each chunk and does the histogram/grid sum combined with a scheme to combine each of the chunks into the final summed distribution should be fairly straightforward.

One complication: the branch that I plan to turn into a PR to be merged soon, issue/173/summ_errors, changes the format of all four summarizers to return N samples as an ensemble rather than a single PDF, where the samples are based on bootstraps. Computing bootstraps in chunks is a little more complicated, we should probably look up the "bag of little bootstraps" technique, and try to figure out the optimal way to return the N samples with proper bootstrap statistics. For PointEstimateHist and NaiveStack, the bootstrap is done on the large number of PDFs, and the samples are going to be quite similar, NZDir is the more realistic case with samples determined by the smaller spectroscopic sample, which will lead to more reasonable sample uncertainties via the bootstraps. (i.e. I think NZDir is the summarizer we should actually focus on in terms of implementing the bootstraps properly).

joselotl commented 1 year ago

Hi @sschmidt23 I'm starting to work in this issue. I saw that NZDir has the bootstrap implemented over the training dataset. But I think that doing the "bag of little bootstraps" technique sampling over the training dataset will be unrealistic, because each node will need to use the full photometric data set. Maybe doing the sampling over the photometric data will be better. How large will be the final training dataset? Is it feaseble to send to each node the full training dataset?

sschmidt23 commented 1 year ago

My guess is that the training set will be big but not nearly as big as the photometric datasets, probably on order of 100,000 - 1,000,000 galaxies or so (there are larger samples of bright galaxies, e.g. SDSS, but there's no reason to use all of them when we only have 10,000-100,000 faint galaxy spec-z's where the bulk of our data actually live in parameter space, I think). So, I think it still might be feasible to try bootstrapping the spec-z sample.

The other reason to do this is that bootstrapping the spec-z sample is one way in which other surveys have estimated one particular aspect of their error budget. I know DES did this, it's referenced in their SOM paper, I can dig up the reference in a bit, the paper may also have details on what strategy they used to do the bootstrap, whether it was bag of little bootstraps or some other method.

sschmidt23 commented 1 year ago

I believe that this is the paper that did a spec-z bootstrap: https://ui.adsabs.harvard.edu/abs/2020A%26A...633A..69H/abstract However, they point out that it isn't a large effect, and effects like sample variance from small fields could be a bigger effect. We should think about what different strategies we want to test in terms of how we estimate uncertainties.

yanzastro commented 1 year ago

I have a basic question here: are we going to parallelize separately for each summarizer or use one method to parallelize all of them?

One related question is that the minisom package is quite nice but not parallelized. I was thinking if we could make a fork of it and parallelize.

joselotl commented 1 year ago

Hi @yanzastro

I am still not sure how to proceed with the parallelization, I have two posibilities.

Initially I would either work with the estimators that can be parallelize by dividing the data in chunks (NaiveStack, PointEstimateHist)
Or I can work in the parallelization of the boobstraps, as it appears to be common to most of the summarizers so far.

Both possibilities would work with the summarizer phase of simpleSOM. To parallelize the training phase, as you said, we should take a look at minisom

sschmidt23 commented 1 year ago

We discussed what quantities that we might want to use as the quantity that is bootstrapped at the Chicago meeting, however, I do not think that we came up with many actual examples beyond generic statements like we might want to bootstrap on some flag in the data that could be tied to a variety of factors (observing conditions, position on the sky, etc...), so that might be a place to think about going forward.
As you say, simply parallelizing the bootstraps is also an option, since any summarizer will be producing N bootstrap samples. However, I think Joe Zuntz's concern was trying to load all of the data in memory for the summarizers as infeasible, and why bag of little bootstraps was mentioned as a possibility while chunking up the data. We should discuss what we want to prioritize first at a meeting in the near future.

@yanzastro as for simplesom and parallelization, I believe one of the primary authors of minisom also wrote an alternative version called xpysom that Joe Zuntz forked: https://github.com/joezuntz/xpysom I believe this has some parallelization for multi-cpu/gpu speedup. I need to look into this to see how seamless it would be to swap minisom for xpysom in the current implementation, but I have not gotten around to doing that yet. My guess is that it will be very straightforward to make that switch.

eacharles commented 1 year ago

Should we set up a meeting with the relevant people?

janewman-pitt-edu commented 1 year ago

Yes, it’s hard to come up with a realistic situation where sample/cosmic variance doesn’t completely dominate over shot noise. The cases where this would happen would be if you had galaxies widely distributed across the sky AND at high redshift (at low redshift variance is still substantial, e.g. as seen with the SDSS great wall). Something like that MAY come from Subaru/PFS in-kind program, but even then the sky areas sampled are likely not that huge.

Best,

Jeff

On Sep 2, 2022, at 12:32 PM, Sam Schmidt @.**@.>> wrote:

My guess is that the training set will be big but not nearly as big as the photometric datasets, probably on order of 100,000 - 1,000,000 galaxies or so (there are larger samples of bright galaxies, e.g. SDSS, but there's no reason to use all of them when we only have 10,000-100,000 faint galaxy spec-z's where the bulk of our data actually live in parameter space, I think). So, I think it still might be feasible to try bootstrapping the spec-z sample.

The other reason to do this is that bootstrapping the spec-z sample is one way in which other surveys have estimated one particular aspect of their error budget. I know DES did this, it's referenced in their SOM paper, I can dig up the reference in a bit, the paper may also have details on what strategy they used to do the bootstrap, whether it was bag of little bootstraps or some other method.

— Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/RAIL/issues/209#issuecomment-1235750940, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAX2KNJHIBZ545M47ZQGUQTV4I23XANCNFSM5Z72RXSA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

LSSTDESC / rail_attic

Add parallelization to the summarizers #209