New configuration to specify point estimate calculations in `CatEstimator` stages

hdante commented 1 year ago

Hello, here's a suggestion that was discussed with Alex Malz, add configuration support in CatEstimator to disable the generation of the z mode auxiliary table when executing an estimation procedure. Specific estimators should inspect the configuration and avoid creating the table if requested.

drewoldag commented 1 year ago

Some additional discussion with Alex provided more clarity around the full scope of the work here. The primary goal here is to avoid unnecessary computation and storage costs when running CatEstimator stages.

In rail_base.estimator.CatEstimator we should introduce a new config_option, calculated_point_estimates = []. We'll need to provide some amount of instruction for devs who create child classes of this so that it's clear what values can be provided. We anticipate that calculated_point_estimates will be a list of strings, such as "mode", "median", "mean", and at some point in the future "rbpe". Specifically, the values should be the point estimate method names provided by the qp.Ensemble class.

Child classes of CatEstimator would respect the values in calculated_point_estimates and only execute the qp.Ensemble point estimate methods requested. And of course, store the Ensemble along with the output of the requested methods.

Looking at rail_flexzboost as an example (https://github.com/LSSTDESC/rail_flexzboost/blob/v0.1.2/src/rail/estimation/algos/flexzboost.py#L257), we can see that zmode is calculated by default on behalf of the user, and stored in the output hdf5 file on line 262. The change described in this issue would require that the user configure the stage to calculate and store those values explicitly as in the following snippet:

pzflex_qp_flexzboost = FZBoost.make_stage(
      name='fzboost_flexzboost',
      ...
      calculated_point_estimates=['mode'],
)

sschmidt23 commented 1 year ago

When we do this, it might make sense to have this be one of the parameters that lives in SHARED_PARAMS, and also to have the methods calculate the point estimates by default. That is, users would have to turn this off with an overriding config param rather than turn it on, if that makes sense.

drewoldag commented 1 year ago

@sschmidt23 I agree that it makes sense to add this as a key on SHARED_PARAMS. In the long run, I can imagine that it might make sense to split SHARED_PARAMS into separate inform and estimator versions. And at that stage, you're not very far from implementing a dataclass like infrastructure. So in the long run, I might advocate for something like that. But to be clear, not as part of this issue

WRT to whether the default should be no calculations or all calculations, I don't have a strong opinion, and the rail_base part is just as easy to implement either way.

My initial opinion would be to not calculate by default since the user can always make a second pass to calculate the point estimates., And if they don't need them, don't spend the extra time. However if the general use case is that the point estimates will always be used, then it makes sense to be on by default.

And either way, the implementation required in the subclasses of CatEstimator are going to be the long straw.

drewoldag commented 1 year ago

If @sschmidt23, @aimalz, and @hdante can settle on the point of whether it should be "all on" or "all off" by default, that would be helpful 😄

drewoldag commented 1 year ago

A question regarding SHARED_PARAMS - I'm looking at trainZ.TrainZ and I'm unfamiliar with the syntax here: https://github.com/LSSTDESC/rail_base/blob/main/src/rail/estimation/algos/train_z.py#L65-L67

Copied for ease:

config_options.update(zmin=SHARED_PARAMS, ...)

Doesn't this just assign a reference to the SHARED_PARAMS dictionary to the key "zmin"? I assumed the intention was something along the lines of:

config_options.update(zmin=SHARED_PARAMS['zmin'], ...)

But like I said, perhaps there's some syntactic sugar I'm unfamiliar with.

aimalz commented 1 year ago

If @sschmidt23, @aimalz, and @hdante can settle on the point of whether it should be "all on" or "all off" by default, that would be helpful smile

Bearing in mind that this choice has no impact on DESC's pipeline efforts, it seems wisest to default to off, due to the potentially large computational expense and because there are several possibilities for how to derive a point estimate that could be problematic depending on properties of the data and/or analysis situation -- it's not in any user's best interest to mindlessly use canned point estimates without understanding their limitations. Because folks (should) know that the point estimates are ill-suited to Rubin data, they'd really only be used for toy/placeholder stages of an analysis under development or for quick and dirty diagnostic purposes, both cases where the set of data would be small compared to the full-scale analyses for which point estimates would be inappropriate and costly. This holds for the ongoing efforts of the brokers, Rubin-wide in-kind teams, and commissioning team to put qp.Ensemble data products into practice. On the other hand, the pipelines we're developing in DESC will initially have to explore multiple point estimates at that diagnostic level, so hardcoding any wouldn't save us any work on having to overwrite the default configs.

eacharles commented 1 year ago

We should figure out what the current effective configuration for the estimators is so that we can be minimal disruptive.On Jul 27, 2023, at 4:34 PM, Alex Malz @.***> wrote:

If @sschmidt23, @aimalz, and @hdante can settle on the point of whether it should be "all on" or "all off" by default, that would be helpful smile

Bearing in mind that this choice has no impact on DESC's pipeline efforts, it seems wisest to default to off, due to the potentially large computational expense and because there are several possibilities for how to derive a point estimate that could be problematic depending on properties of the data and/or analysis situation -- it's not in any user's best interest to mindlessly use canned point estimates without understanding their limitations. Because folks (should) know that the point estimates are ill-suited to Rubin data, they'd really only be used for toy/placeholder stages of an analysis under development or for quick and dirty diagnostic purposes, both cases where the set of data would be small compared to the full-scale analyses for which point estimates would be inappropriate and costly. This holds for the ongoing efforts of the brokers, Rubin-wide in-kind teams, and commissioning team to put qp.Ensemble data products into practice. On the other hand, the pipelines we're developing in DESC will initially have to explore multiple point estimates at that diagnostic level, so hardcoding any wouldn't save us any work on having to overwrite the default configs.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

sschmidt23 commented 1 year ago

Continuity with current config was one reason that I wanted "all on" by default, as we do include point estimate calculations by default currently. Also, for several algorithms the point estimate calculation is not intensive, it's only a small amount of additional computation to add them. Third, I think we know that many users do like having the point estimates, so a default of having them seems like a good idea.

I don't have super strong thoughts, but default including them seemed obvious to me unless there was an argument against it.

drewoldag commented 1 year ago

@eacharles my cursory investigation of a handful of estimators is that the effective configuration covers the spectrum from one point estimate to all available point estimates.

Minimally disruptive does seem important! Again, I don't have a strong opinion, but if we do make a disruptive change, let's do it now prior to v1 :)

LSSTDESC / rail_base

New configuration to specify point estimate calculations in `CatEstimator` stages #11