Closed zeileis closed 2 years ago
I think it makes most sense to put this in topmodels
or some other package more focused on scoring rules/model assessment. So far we don't really have anything for that in distributions3
.
OK, fair enough! I think the crucial question will be whether we will get package maintainers (for packages like mgcv
, gamlss
, etc.) to provide prodist()
methods for their models and - if necessary - write new distributions
objects.
I have just done so for all my count regression functionality in package pscl
and the successor countreg
. In order to make the infrastructure work easily for both packages, I have contributed it to distributions3
. But for our crch
package I have set up new families CensoredNormal()
, CensoredLogistic()
, CensoredStudentsT()
, TruncatedNormal()
, TruncatedLogistic()
, and TruncatedStudentsT()
. As the corresponding d
/p
/q
/r
functions have been in crch
anyway, it made most sense to create the distributions classes and methods there as well.
In short: I think that the infrastructure is modular enough for hosting it across different packages. And, if necessary, migrating them to another joint package in the future would also be possible.
I'm closing the issue here now. If anyone is interested in it, check out the topmodels
package (currently just on R-Forge but hopefully also on CRAN soon).
I have one update regarding this issue which might be interesting enough to warrant a reconsideration of a crps()
method for distributions3
.
So far, we just had support for computing the CRPS for those distributions that had explicit crps_*()
functions in the scoringRules
package.
But now my colleague @RetoStauffer has written a general crps.distribution()
method that computes the CRPS for any distribution
object with cdf()
and quantile()
methods. It does so by numeric integration in dedicated efficient C code. It is slower than the closed-form computations from scoringRules
but fast enough to be applicable to millions of observations/distributions.
Of course, one can still argue that we should put this into topmodels
because distributions3
is not primarily about evaluation. However, it would be nice if any package that decides to produce distributions3
objects with cdf()
and quantile()
methods would get a crps()
method on top for free. Thus, no additional dependencies etc. would need to be added in those packages. An example for this would be gamlss.dist
whose authors are currently adding distributions3
support for all their distributional families.
When I say "for free", this is from the perspective of such new packages. From the perspective of the distributions3
package, there is a price (albeit IMO not huge): An additional "Suggests" dependency (scoringRules
), the need for compilation of the package (due to the C code), and an additional contributor (RS).
What do you think Alex @alexpghayes : Does this merit reconsideration for distributions3
?
(Disclaimer: "No" is a fair answer, of course.)
This is a very cool development, but I think it is probably still best in topmodels
. My primary concern here is that we use distributions3
in some very intro level courses IIUC, and adding compiled code can create some real headaches for novice R users. I know it's a bummer to have this in topmodels
rather than distributions3
, but we can link heavily to the topmodels
functionality in the documentation, and possibly even include topmodels
in Suggests
or Enhances
and re-conditionally export the crps()
method (if that's a thing?).
OK, that is a fair concern! We'll put it into topmodels
then. And once that package is on CRAN we can think about how to best advertise the connections between the two.
The continuous ranked probability score (CRPS) is a popular and strictly proper scoring rule for probabilistic models and forecasts which has some advantages over the so-called log-score (equivalent to the log-density aka log-likelihood). For example, it is bounded as the precision increases (i.e., variance decreases).
It would be nice to have
crps()
methods for thedistribution
objects in the package analogously to thelog_pdf()
methods. Implementation would be relatively straightforward using the nice scoringRules package (see also their JSS paper). For example, for the normal distribution:It would be enough to include
scoringRules
in the "Suggests" as the S3 methods can be registered conditionally.scoringRules
does not support all of the distributions indistributions3
but most of them. Should I go ahead and prepare a PR for this? (If you would prefer not to have it indistributions3
I could also put this into thetopmodels
package.)