Open TimothyWillard opened 3 weeks ago
The following are TODO comments found in this module, it's unclear to me what some of them mean.
1) I'm not sure what this one is for, maybe resample_skipna
shouldn't be False
be default?
self.resample_skipna = False # TODO
if (
resample_config["aggregator"].exists()
and resample_config["skipna"].exists()
):
self.resample_skipna = resample_config["skipna"].get()
2) This one is a bit more clear, sounds like the suggested solution here is adding a "set_zeros_to" field to the configuration that is parsed and recognized here, but I'm not sure what setting this option should do. I think this one is clear and involved enough to be split out into a separate issue.
self.zero_to_one = False
# TODO: this should be set_zeros_to and only do it for the probability
if statistic_config["zero_to_one"].exists():
self.zero_to_one = statistic_config["zero_to_one"].get()
Also there is this snippet in Statistic.llik
that might be related:
if self.zero_to_one:
# so confusing, wish I had not used xarray to do model_data[model_data==0]=1
model_data = model_data.where(model_data != 0, 1)
gt_data = gt_data.where(gt_data != 0, 1)
3) I think this is referring to some sort of parameterization issue, but I'm not sure. Probably can be add as a task to GH-290.
"norm_cov": lambda x, loc, scale: scipy.stats.norm.logpdf(
x, loc=loc, scale=scale * loc.where(loc > 5, 5)
), # TODO: check, that it's really the loc
4) I'm not certain what this means, maybe the Statistic.llik
method is supposed to make guarantees about the subpopulation order in the outputted data?
likelihood = xr.DataArray(likelihood, coords=gt_data.coords, dims=gt_data.dims)
# TODO: check the order of the arguments
return likelihood
@jcblemai the git blame
suggests that maybe you wrote these comments if you have any particular thoughts on them and what should and shouldn't be split out into new issues or added to existing issues?
I think the RMSE entry in the dist_map
variable of Statistic.llik
is wrong. np.nansum
is called after np.sqrt
, so I think it's the same as absolute error:
"rmse": lambda x, y: -np.log(np.nansum(np.sqrt((x - y) ** 2))),
"absolute_error": lambda x, y: -np.log(np.nansum(np.abs(x - y))),
And testing empirically confirms this:
>>> import numpy as np
>>> x = np.random.randn(10)
>>> y = np.random.randn(10)
>>> -np.log(np.nansum(np.sqrt((x - y) ** 2)))
np.float64(-2.4563520535667993)
>>> -np.log(np.nansum(np.abs(x - y)))
np.float64(-2.4563520535667993)
Is RMSE used in practice for statistics? Would there be consequences for fixing this bug?
@saraloo, @jcblemai
Thanks @TimothyWillard for going through this. This module was written as part of a 3 days marathon to get a Python inference working from scratch, so a second pair of eyes is more than welcome. This code is not used yet in production (well, it is more and more) but will be very soon.
- I'm not sure what this one is for, maybe
resample_skipna
shouldn't beFalse
be default?
It should certainly not be True by default, but I think we might want Statistic to fail if it not specified and there is a NA. The ideal behavior for this part of the code is a bit unclear.
- This one is a bit more clear, sounds like the suggested solution here is adding a "set_zeros_to" field to the configuration that is parsed and recognized here, but I'm not sure what setting this option should do. I think this one is clear and involved enough to be split out into a separate issue.
self.zero_to_one = False # TODO: this should be set_zeros_to and only do it for the probability if statistic_config["zero_to_one"].exists(): self.zero_to_one = statistic_config["zero_to_one"].get()
When the time series one fits to (the target/ground-truth) contains zeros, then for some likelihood, it imposes a very strong belief that this is exactly zero and not one. For this, it is common practice (though not necessarily best practice) to change the zero to something like .5 or 1. In the R inference, there is a
set zero to one
flag, but I would like to leave that to any number up to the user. Agree on separate issue.
- I think this is referring to some sort of parameterization issue, but I'm not sure. Probably can be add as a task to Convert
as_random_distribution
To Usescipy.stats
Distributions #290."norm_cov": lambda x, loc, scale: scipy.stats.norm.logpdf( x, loc=loc, scale=scale * loc.where(loc > 5, 5) ), # TODO: check, that it's really the loc
This is to mimic a likelihood function which I really don't like but is our most used at the moment and in the R inference. Basically, the sd scales with the number. I would remove that but I need it to benchmark the new python inference vs the old r inference. We could also keep it making the 5 a parameter. ((ooh sorry re-reading I saw the comment. Yes, potential parametrization issue).
- I'm not certain what this means, maybe the
Statistic.llik
method is supposed to make guarantees about the subpopulation order in the outputted data?likelihood = xr.DataArray(likelihood, coords=gt_data.coords, dims=gt_data.dims) # TODO: check the order of the arguments return likelihood
Yes, that's on the date. Not necessarily adding a check but at least making sure the code is correct
s RMSE used in practice for statistics? Would there be consequences for fixing this bug?
Very good find for the bug, the python inference has been used a few time on production already, I think only with poisson llik for now. But this is important and a priority.
Label
documentation, enhancement, gempyor
Priority Label
medium priority
Describe the bug/issue
The
gempyor.statistics
module has light documentation, but it is largely lacking and unit tests are non-existent as far as I can tell. Will also want to "repair" the module as issues are discovered (remove dead code, consolidate duplicated code, move TODO comments to GitHub issues, etc.). Similar to GH-246.To Reproduce
The existing documentation is given by:
And missing unit tests:
Environment, if relevant
main branch,
8b1ec08933dd57033c361bad46760e65b6ee90c5