Added option for z_truth to be inside an hdf5 group

joselotl commented 6 months ago

Problem & Solution Description (including issue #)

If the spec-z file has truth-z inside an h5 group (for example ['photometry']['z_true']) the evaluator needs to use hdf5_groupname to access this row.

Code Quality

[ ] My code follows the code style of this project
[ ] I have written unit tests or justified all instances of #pragma: no cover; in the case of a bugfix, a new test that breaks as a result of the bug has been added
[ ] My code contains relevant comments and necessary documentation for future maintainers; the change is reflected in applicable demos/tutorials (with output cleared!) and added/updated docstrings use the NumPy docstring format
[ ] Any breaking changes, described above, are accompanied by backwards compatibility and deprecation warnings

codecov[bot] commented 6 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 96.78%. Comparing base (0b731ec) to head (f374463).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #94 +/- ## ======================================= Coverage 96.78% 96.78% ======================================= Files 31 31 Lines 1584 1585 +1 ======================================= + Hits 1533 1534 +1 Misses 51 51 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

joselotl commented 6 months ago

I did a small change to the evaluator.py so that we would use catalogs where z_truth is inside an hdf5 group like ['photometry']

sschmidt23 commented 6 months ago

I think what most of the estimator codes do is read in the training data as a whole and then grab the redshift from that data in a separate line. But, this doesn't cover the case where the redshift column might be in a different hdf5 group, should we be worried about that at all? The other thing is, with the code in this PR, if there happened to be a column with name self.config.redshift_col in the top level hdf5, then the try statement would grab that even if the intended redshift_col was in the sub-group. I think maybe the better thing to do would be what we do in several of the estimator codes, e.g. this from rail_flexzboost:

        if self.config.hdf5_groupname:
            training_data = self.get_data('input')[self.config.hdf5_groupname]
        else:  # pragma: no cover
            training_data = self.get_data('input')
        speczs = np.array(training_data[self.config['redshift_col']])

joselotl commented 6 months ago

@sschmidt23 If I try to use self.config.hdf5_groupname I get an error when it was not previously defined. Or if I add hdf5_groupname=SHARED_PARAMS['hdf5_groupname'] to the config_options, the test asigns the value "photometry" when I want it to have a default value of None

sschmidt23 commented 6 months ago

I just pushed a change where to get the test to pass I set an hdf5_groupname="" in the test's make_stage setup. If you want a default value of hdf5_groupname of None or "", then maybe we could define a config parameter with a different name from hdf5_groupname, maybe truth_groupname or something like that to distinguish it from the hdf5_groupname parameter that we use for the estimators, and then we can set that config parameter to have whatever default we want. I guess it's really a question of whether we think that the truth data is always going to be read from files that follow the same patterns of having and hdf5_groupname that is the same as the data from the estimators. If so, hdf5_groupname is probably fine, if we think it will occasionally be different, then maybe a unique config param is needed.

sschmidt23 commented 6 months ago

Ok, I'm fine with doing it this way, I'll approve the PR, if you think about things further and think we need a "truth_groupname" we can add that in a future PR.

LSSTDESC / rail_base