Input data naming in estimation codes

sschmidt23 commented 3 years ago

Currently the estimation codes assume that the ingested data has names "mag_{band}_lsst and magerr{band}_lsst for band in [u,g,r,i,z,y], and redshift as 'redshift' in the training data. We should change this to a more generic naming for added flexibility.

johannct commented 3 years ago

it would live in the config then? any idea what kind of generic naming could be used? Another point at stake is what to do if an algo is able to use ancillary information in the templating or learning, like the errors to start with?

sschmidt23 commented 3 years ago

The errors are already included, but yes, algorithms that want to use additional information is one concern, as well as photometry names that may have different names, e.g. including a "dered" or other extension in the name.

aimalz commented 3 years ago

I agree that we'll have to standardize the naming/ordering of columns in all data we provide for each set of test conditions as well as give estimators the opportunity to establish the correspondence between those names and whatever variable names their code uses internally, while also minimizing the number of files that have to be updated if we have to make changes to the schema. Maybe we could keep hardcoded column names/orders out of the subclass modules by taking advantage of the dual global vs. per-estimator config structure? (That still wouldn't address if the meaning of the inputs changed, but I'm having trouble imagining an example that we couldn't preprocess our way around.)

There would be a bigger problem if the codes were expecting inherently different information, which we anticipate as the included codes grow in complexity. (I think @jmccull had put some thought into this early on.) Though we don't yet have a way to forward-model mock data consistent with SED templates (rather than the much simpler photometry necessary for machine learning codes), @sylvielsstfr is implementing the first (hybrid) template-fitting estimator subclass that will certainly encounter this in #49, and @sschmidt23 has mentioned wanting to wrap another soon. As those develop, I'm sure we'll gain a better understanding of the needs for the provision of shared ancillary data across codes that may be expecting diverse formats, let alone different content entirely.

johannct commented 3 years ago

add LePhare++ to the list.....

jfcrenshaw commented 2 years ago

This issue is also relevant to the Creation Module, where we need to decide on a default naming scheme for redshift and the LSST bandpasses in Creator, FlowEngine, and LSSTErrorModel.

My main request is that parameter errors are named ..._err, as this is what pzflow looks for when looking for errors. For example, when looking for errors for a column named u, pzflow looks for a column named u_err.

yanzastro commented 2 years ago

I created a new branch for this issue flexible_colname. I added two new config parameters into the Estimator class to specify column names for magnitudes and magnitude errors. The format should be: prefix_{}_suffix, then "{}" will be replaced by 'ugrizy' and save to two dictionaries that look like:

{'u': 'prefix_u_suffix', ...};

I also updated other estimators in accordance with this change in the superclass, except pzflow and knnpz which have their own way to specify column names. Still need to work out a unified way for all the subclasses... The updates have been tested with bpz_lite but not other estimators yet.

sschmidt23 commented 1 year ago

I just checked all of the estimators, we have flexible ingestion of column names via a config parameter for all of the "real" estimators, though I see mag_i_lsst is still hardcoded for the dummy "randomPZ" estimator that just generates a random Gaussian, as well as redshift being hardcoded for trainZ. I'll create a quick PR to remove the hardcoded names and replace with config parameters so we can close this very old issue.

LSSTDESC / rail_attic

Input data naming in estimation codes #47