alexpghayes / distributions3

Probability Distributions as S3 Objects
https://alexpghayes.github.io/distributions3/
Other
100 stars 16 forks source link

Support continuous ranked probability score (CRPS)? #88

Closed zeileis closed 2 years ago

zeileis commented 2 years ago

The continuous ranked probability score (CRPS) is a popular and strictly proper scoring rule for probabilistic models and forecasts which has some advantages over the so-called log-score (equivalent to the log-density aka log-likelihood). For example, it is bounded as the precision increases (i.e., variance decreases).

It would be nice to have crps() methods for the distribution objects in the package analogously to the log_pdf() methods. Implementation would be relatively straightforward using the nice scoringRules package (see also their JSS paper). For example, for the normal distribution:

crps.Normal <- function(y, x, drop = TRUE, elementwise = NULL, ...) {
  stopifnot(requireNamespace("scoringRules"))
  FUN <- function(at, d) scoringRules::crps_norm(y = at, mean = d$mu, sd = d$sigma)
  apply_dpqr(d = y, FUN = FUN, at = x, type = "crps", drop = drop, elementwise = elementwise)
}

It would be enough to include scoringRules in the "Suggests" as the S3 methods can be registered conditionally. scoringRules does not support all of the distributions in distributions3 but most of them. Should I go ahead and prepare a PR for this? (If you would prefer not to have it in distributions3 I could also put this into the topmodels package.)

alexpghayes commented 2 years ago

I think it makes most sense to put this in topmodels or some other package more focused on scoring rules/model assessment. So far we don't really have anything for that in distributions3.

zeileis commented 2 years ago

OK, fair enough! I think the crucial question will be whether we will get package maintainers (for packages like mgcv, gamlss, etc.) to provide prodist() methods for their models and - if necessary - write new distributions objects.

I have just done so for all my count regression functionality in package pscl and the successor countreg. In order to make the infrastructure work easily for both packages, I have contributed it to distributions3. But for our crch package I have set up new families CensoredNormal(), CensoredLogistic(), CensoredStudentsT(), TruncatedNormal(), TruncatedLogistic(), and TruncatedStudentsT(). As the corresponding d/p/q/r functions have been in crch anyway, it made most sense to create the distributions classes and methods there as well.

In short: I think that the infrastructure is modular enough for hosting it across different packages. And, if necessary, migrating them to another joint package in the future would also be possible.

I'm closing the issue here now. If anyone is interested in it, check out the topmodels package (currently just on R-Forge but hopefully also on CRAN soon).

zeileis commented 1 year ago

I have one update regarding this issue which might be interesting enough to warrant a reconsideration of a crps() method for distributions3.

So far, we just had support for computing the CRPS for those distributions that had explicit crps_*() functions in the scoringRules package.

But now my colleague @RetoStauffer has written a general crps.distribution() method that computes the CRPS for any distribution object with cdf() and quantile() methods. It does so by numeric integration in dedicated efficient C code. It is slower than the closed-form computations from scoringRules but fast enough to be applicable to millions of observations/distributions.

Of course, one can still argue that we should put this into topmodels because distributions3 is not primarily about evaluation. However, it would be nice if any package that decides to produce distributions3 objects with cdf() and quantile() methods would get a crps() method on top for free. Thus, no additional dependencies etc. would need to be added in those packages. An example for this would be gamlss.dist whose authors are currently adding distributions3 support for all their distributional families.

When I say "for free", this is from the perspective of such new packages. From the perspective of the distributions3 package, there is a price (albeit IMO not huge): An additional "Suggests" dependency (scoringRules), the need for compilation of the package (due to the C code), and an additional contributor (RS).

What do you think Alex @alexpghayes : Does this merit reconsideration for distributions3? (Disclaimer: "No" is a fair answer, of course.)

alexpghayes commented 1 year ago

This is a very cool development, but I think it is probably still best in topmodels. My primary concern here is that we use distributions3 in some very intro level courses IIUC, and adding compiled code can create some real headaches for novice R users. I know it's a bummer to have this in topmodels rather than distributions3, but we can link heavily to the topmodels functionality in the documentation, and possibly even include topmodels in Suggests or Enhances and re-conditionally export the crps() method (if that's a thing?).

zeileis commented 1 year ago

OK, that is a fair concern! We'll put it into topmodels then. And once that package is on CRAN we can think about how to best advertise the connections between the two.