[Bug]: `gempyor.statistics` Missing Documentation/Unit Tests

Label

documentation, enhancement, gempyor

Priority Label

medium priority

Describe the bug/issue

The gempyor.statistics module has light documentation, but it is largely lacking and unit tests are non-existent as far as I can tell. Will also want to "repair" the module as issues are discovered (remove dead code, consolidate duplicated code, move TODO comments to GitHub issues, etc.). Similar to GH-246.

To Reproduce

The existing documentation is given by:

(venv) twillard@epid-iss-MBP ~/D/G/H/f/f/gempyor_pkg (main)> python
Python 3.11.9 (main, Apr  2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from gempyor.statistics import *
>>> print(Statistic.__doc__)

    A statistic is a function that takes two time series and returns a scalar value.
    It applies resample, scale, and regularization to the data before computing the statistic's log-loss.
    Configuration:
    - sim_var: the variable in the simulation data
    - data_var: the variable in the ground truth data
    - resample: resample the data before computing the statistic
        - freq: the frequency to resample the data to
        - aggregator: the aggregation function to use
        - skipna: whether to skip NA values
    - regularize: apply a regularization term to the data before computing the statistic

    # SkipNA is False by default, which results in NA values broadcasting when resampling (e.g a NA withing a sum makes the whole sum a NA)
    # if True, then NA are replaced with 0 (for sum), 1 for product, ...
    # In doubt, plot stat.plot_transformed() to see the effect of the resampling

And missing unit tests:

(venv) twillard@epid-iss-MBP ~/D/G/H/f/f/gempyor_pkg (main)> grep -rn tests/ -e "statistic"
(venv) twillard@epid-iss-MBP ~/D/G/H/f/f/gempyor_pkg (main) [0|1]>

Environment, if relevant

main branch, 8b1ec08933dd57033c361bad46760e65b6ee90c5

The following are TODO comments found in this module, it's unclear to me what some of them mean.

1) I'm not sure what this one is for, maybe resample_skipna shouldn't be False be default?

            self.resample_skipna = False  # TODO
            if (
                resample_config["aggregator"].exists()
                and resample_config["skipna"].exists()
            ):
                self.resample_skipna = resample_config["skipna"].get()

2) This one is a bit more clear, sounds like the suggested solution here is adding a "set_zeros_to" field to the configuration that is parsed and recognized here, but I'm not sure what setting this option should do. I think this one is clear and involved enough to be split out into a separate issue.

        self.zero_to_one = False
        # TODO: this should be set_zeros_to and only do it for the probability
        if statistic_config["zero_to_one"].exists():
            self.zero_to_one = statistic_config["zero_to_one"].get()

Also there is this snippet in Statistic.llik that might be related:

        if self.zero_to_one:
            # so confusing, wish I had not used xarray to do model_data[model_data==0]=1
            model_data = model_data.where(model_data != 0, 1)
            gt_data = gt_data.where(gt_data != 0, 1)

3) I think this is referring to some sort of parameterization issue, but I'm not sure. Probably can be add as a task to GH-290.

            "norm_cov": lambda x, loc, scale: scipy.stats.norm.logpdf(
                x, loc=loc, scale=scale * loc.where(loc > 5, 5)
            ),  # TODO: check, that it's really the loc

4) I'm not certain what this means, maybe the Statistic.llik method is supposed to make guarantees about the subpopulation order in the outputted data?

        likelihood = xr.DataArray(likelihood, coords=gt_data.coords, dims=gt_data.dims)

        # TODO: check the order of the arguments
        return likelihood

@jcblemai the git blame suggests that maybe you wrote these comments if you have any particular thoughts on them and what should and shouldn't be split out into new issues or added to existing issues?

I think the RMSE entry in the dist_map variable of Statistic.llik is wrong. np.nansum is called after np.sqrt, so I think it's the same as absolute error:

            "rmse": lambda x, y: -np.log(np.nansum(np.sqrt((x - y) ** 2))),
            "absolute_error": lambda x, y: -np.log(np.nansum(np.abs(x - y))),

And testing empirically confirms this:

>>> import numpy as np
>>> x = np.random.randn(10)
>>> y = np.random.randn(10)
>>> -np.log(np.nansum(np.sqrt((x - y) ** 2)))
np.float64(-2.4563520535667993)
>>> -np.log(np.nansum(np.abs(x - y)))
np.float64(-2.4563520535667993)

Is RMSE used in practice for statistics? Would there be consequences for fixing this bug?

@saraloo, @jcblemai

Thanks @TimothyWillard for going through this. This module was written as part of a 3 days marathon to get a Python inference working from scratch, so a second pair of eyes is more than welcome. This code is not used yet in production (well, it is more and more) but will be very soon.

I'm not sure what this one is for, maybe resample_skipna shouldn't be False be default?

It should certainly not be True by default, but I think we might want Statistic to fail if it not specified and there is a NA. The ideal behavior for this part of the code is a bit unclear.

This one is a bit more clear, sounds like the suggested solution here is adding a "set_zeros_to" field to the configuration that is parsed and recognized here, but I'm not sure what setting this option should do. I think this one is clear and involved enough to be split out into a separate issue.
        self.zero_to_one = False
        # TODO: this should be set_zeros_to and only do it for the probability
        if statistic_config["zero_to_one"].exists():
            self.zero_to_one = statistic_config["zero_to_one"].get()
When the time series one fits to (the target/ground-truth) contains zeros, then for some likelihood, it imposes a very strong belief that this is exactly zero and not one. For this, it is common practice (though not necessarily best practice) to change the zero to something like .5 or 1. In the R inference, there is aset zero to one flag, but I would like to leave that to any number up to the user. Agree on separate issue.

I think this is referring to some sort of parameterization issue, but I'm not sure. Probably can be add as a task to Convert as_random_distribution To Use scipy.stats Distributions #290.
            "norm_cov": lambda x, loc, scale: scipy.stats.norm.logpdf(
                x, loc=loc, scale=scale * loc.where(loc > 5, 5)
            ),  # TODO: check, that it's really the loc
This is to mimic a likelihood function which I really don't like but is our most used at the moment and in the R inference. Basically, the sd scales with the number. I would remove that but I need it to benchmark the new python inference vs the old r inference. We could also keep it making the 5 a parameter. ((ooh sorry re-reading I saw the comment. Yes, potential parametrization issue).

I'm not certain what this means, maybe the Statistic.llik method is supposed to make guarantees about the subpopulation order in the outputted data?
        likelihood = xr.DataArray(likelihood, coords=gt_data.coords, dims=gt_data.dims)

        # TODO: check the order of the arguments
        return likelihood
Yes, that's on the date. Not necessarily adding a check but at least making sure the code is correct

s RMSE used in practice for statistics? Would there be consequences for fixing this bug?

Very good find for the bug, the python inference has been used a few time on production already, I think only with poisson llik for now. But this is important and a priority.

HopkinsIDD / flepiMoP