HopkinsIDD / flepiMoP

The Flexible Epidemic Modeling Pipeline
https://flepimop.org
GNU General Public License v3.0
9 stars 4 forks source link

Convert `as_random_distribution` To Use `scipy.stats` Distributions #290

Open TimothyWillard opened 1 month ago

TimothyWillard commented 1 month ago

Is your feature request related to a problem? Please describe.

The gempyor.utils.as_random_distribution/random_distribution_sampler functions use a mix of np.random rngs and scipy.stats rngs. This makes testing these and functions that rely on these functions more convoluted then it should be as demonstrated by gempyor.testing. sample_fits_distribution.

See https://github.com/HopkinsIDD/flepiMoP/pull/277#discussion_r1707145163, https://github.com/HopkinsIDD/flepiMoP/pull/277#discussion_r1707178115.

Is your feature request related to a new application, scenario round, pathogen? Please describe.

The application is mostly for internal use in gempyor, it is unlikely that these functions will be used much externally but would assist in efforts to unit test and stabilize gempyor.

Describe the solution you'd like

Consolidate the logic in gempyor.utils.as_random_distribution/random_distribution_sampler and gempyor.testing. sample_fits_distribution together and to use distributions from scipy.stats instead of from np.random.

TimothyWillard commented 1 month ago

Related to GH-300, but would be nice to also consolidate the custom distribution handling in the llik method of gempyor.statistics.Statistics into this solution as well. This would allow for (more) standardized naming + parameterizations of distributions used throughout flepiMoP's config files. The relevant lines of code in question are:

        dist_map = {
            "pois": scipy.stats.poisson.logpmf,
            "norm": lambda x, loc, scale: scipy.stats.norm.logpdf(
                x, loc=loc, scale=self.params.get("scale", scale)
            ),  # wrong:
            "norm_cov": lambda x, loc, scale: scipy.stats.norm.logpdf(
                x, loc=loc, scale=scale * loc.where(loc > 5, 5)
            ),  # TODO: check, that it's really the loc
            "nbinom": lambda x, n, p: scipy.stats.nbinom.logpmf(
                x, n=self.params.get("n"), p=model_data
            ),
            "rmse": lambda x, y: -np.log(np.nansum(np.sqrt((x - y) ** 2))),
            "absolute_error": lambda x, y: -np.log(np.nansum(np.abs(x - y))),
        }
        if self.dist not in dist_map:
            raise ValueError(f"Invalid distribution specified: {self.dist}")
        if self.dist in ["pois", "nbinom"]:
            model_data = model_data.astype(int)
            gt_data = gt_data.astype(int)

Not really sure what to do about the MSE/MAE options, those aren't likelihoods they're losses so they don't correspond to an underlying probability distribution.