Interface for distributions3 infrastructure, enabling topmodels infrastructure

zeileis commented 5 months ago

Niki @freezenik, in this PR I have done the following:

Added a prodist() method for extracting/predicting probability distribution objects in the framework of the distributions3 package. The idea of the package is that you can work with (vectors of) distributions in an object-oriented way, e.g., for computing moments, probabilities, quantiles, or random numbers etc.
The accompanying manual page shows how to call prodist() directly and then work with the resulting distributions3 objects "by hand". See also https://www.zeileis.org/news/user2022/ for some more background and motivation.
The more interesting part, however, is that it "unlocks" all of the infrastructure in the topmodels package, providing probabilistic forecasts, residuals, scoring rules, goodness-of-fit graphics, etc.
Because the prodist.gamlss2 documentation needed to link to the documentation of predict.gamlss2 I have also added a manual page for this.
Finally, a Suggests dependency is added for distributions3 and the prodist() method is registered in the NAMESPACE. No dependency is needed for topmodels at the moment, but maybe a vignette could be useful in the future.

zeileis commented 5 months ago

Additional remark: I think that the default for the predict() method is a bit unintuitive. While it is consistent with the method for glm objects, I've always felt that this is confusing in practice and in teaching. Personally, I would either use type = "parameter" because that describes the full predicted probability distribution. Alternatively, type = "response" might feel intuitive for those who mainly look for a mean regression model.

zeileis commented 5 months ago

Illustration: After merging the PR, you can work with gamlss2 in combination with the topmodels package from R-Forge (see: https://topmodels.R-Forge.R-project.org/).

Model:

library("gamlss2")
m <- gamlss2(rent ~ s(area) + s(yearc) + location + bath + cheating | s(area) + s(yearc) + location + bath + cheating,
  data = rent99, family = GA)
## GAMLSS-RS iteration  3: Global Deviance = 38257.0504 eps = 0.000005 
logLik(m)
## 'log Lik.' -19128.53 (df=24.69217)

Scoring rules:

library("topmodels")
proscore(m, type = "logLik", aggregate = sum)
##      logLik
## 1 -19128.53
proscore(m, type = c("logLik", "CRPS", "MAE", "MSE")) |>
  head(n = 3)
##      logLik     CRPS      MAE         MSE
## 1 -5.543098 46.49766 85.41233 7295.265874
## 2 -5.408971 23.87493 33.87696 1147.648170
## 3 -5.342137 19.37924  1.71472    2.940265

(In principle, these also work with a newdata argument. But currently extracting the newresponse() does not work for gamlss2, yet. Will add that later.)

Probabilistic forecasting: Full distribution objects, moments, probabilities, quantiles.

nd <- data.frame(area = 5:8 * 10, yearc = 1990, location = "1", bath = "1", cheating = "1")
procast(m, newdata = nd)
##                                                distribution
## 1 GAMLSS GA distribution (mu = 495.6917, sigma = 0.2213075)
## 2 GAMLSS GA distribution (mu = 575.9095, sigma = 0.2238684)
## 3 GAMLSS GA distribution (mu = 634.3635, sigma = 0.2264513)
## 4 GAMLSS GA distribution (mu = 679.2961, sigma = 0.2290589)
procast(m, newdata = nd, type = "variance")
##   variance
## 1 12034.15
## 2 16622.41
## 3 20636.02
## 4 24210.99
procast(m, newdata = nd, type = "cdf", at = 750, lower.tail = FALSE)
##   probability
## 1  0.01871957
## 2  0.09558729
## 3  0.20180883
## 4  0.30315932
procast(m, newdata = nd, type = "quantile", at = c(0.05, 0.5, 0.95))
##     q_0.05    q_0.5   q_0.95
## 1 330.0660 487.6231 688.8469
## 2 381.4687 566.3177 803.0775
## 3 417.9578 623.5536 887.6522
## 4 445.1584 667.4533 953.8415

Goodness-of-fit graphics: Hanging rootogram, PIT histogram, residual Q-Q plot, worm plot (detrended Q-Q plot). Either via base graphics or via ggplot2.

rootogram(m)
pithist(m)
qqrplot(m)
wormplot(m)

topmodels goodness-of-fit graphics for gamlss2

library("ggplot2")
rootogram(m)
pithist(m)
qqrplot(m)
wormplot(m)

topmodels goodness-of-fit graphics for gamlss2 via ggplot2

zeileis commented 5 months ago

Illustration: The manual page ?prodist.gamlss2 also shows how to work with the distributions3 objects "by hand" (rather than via the high-level interfaces in topmodels).

Packages, code, and data:

library("distributions3")
data("cars", package = "datasets")

Fit heteroscedastic normal GAMLSS model: Stopping distance (ft) explained by speed (mph).

m <- gamlss2(dist ~ s(speed) | s(speed), data = cars, family = NO)
## GAMLSS-RS iteration  4: Global Deviance = 405.6976 eps = 0.000000

Obtain predicted distributions for three levels of speed:

d <- prodist(m, newdata = data.frame(speed = c(10, 20, 30)))
print(d)
##                                                    1 
## "GAMLSS NO distribution (mu = 23.04, sigma = 10.06)" 
##                                                    2 
## "GAMLSS NO distribution (mu = 59.04, sigma = 18.51)" 
##                                                    3 
## "GAMLSS NO distribution (mu = 96.35, sigma = 33.95)"

Obtain quantiles (works the same for any distribution object d !):

quantile(d, 0.5)
##        1        2        3 
## 23.03912 59.03607 96.34894 
quantile(d, c(0.05, 0.5, 0.95), elementwise = FALSE)
##      q_0.05    q_0.5    q_0.95
## 1  6.486965 23.03912  39.59128
## 2 28.589634 59.03607  89.48250
## 3 40.504859 96.34894 152.19302
quantile(d, c(0.05, 0.5, 0.95), elementwise = TRUE)
##          1          2          3 
##   6.486965  59.036068 152.193022

Visualization:

plot(dist ~ speed, data = cars)
nd <- data.frame(speed = 0:240/4)
nd$dist <- prodist(m, newdata = nd)
nd$fit <- quantile(nd$dist, c(0.05, 0.5, 0.95))
matplot(nd$speed, nd$fit, type = "l", lty = 1, col = "slategray", add = TRUE)

prodist example for cars cara

Moments:

mean(d)
##        1        2        3 
## 23.03912 59.03607 96.34894 
variance(d)
##         1         2         3 
##  101.2639  342.6244 1152.6562

Simulate random numbers:

random(d, 5)
##        r_1       r_2      r_3        r_4       r_5
## 1 24.91776  25.49919 25.40398 25.5270449  16.98507
## 2 76.66499  86.89972 24.54569 84.5080020  52.88821
## 3 99.13546 116.57334 84.41239  0.9694538 108.17023

Density and distribution:

pdf(d, 50 * -2:2)
##         d_-100        d_-50          d_0        d_50        d_100
## 1 1.365773e-34 1.440745e-13 0.0028836926 0.001095127 7.891025e-15
## 2 2.012488e-18 6.289571e-10 0.0001332378 0.019131663 1.862073e-03
## 3 6.414069e-10 1.084305e-06 0.0002095207 0.004627637 1.168285e-02
cdf(d, 50 * -2:2)
##         p_-100        p_-50          p_0       p_50     p_100
## 1 1.116688e-34 1.961558e-13 0.0110254770 0.99631019 1.0000000
## 2 4.279173e-18 1.923747e-09 0.0007128557 0.31271502 0.9865531
## 3 3.661609e-09 8.139892e-06 0.0022705722 0.08609824 0.5428196

Poisson example:

data("FIFA2018", package = "distributions3")
m2 <- gamlss2(goals ~ s(difference), data = FIFA2018, family = PO)
## GAMLSS-RS iteration  2: Global Deviance = 355.3922 eps = 0.000000    
d2 <- prodist(m2, newdata = data.frame(difference = 0))
print(d2)
##                                     1 
## "GAMLSS PO distribution (mu = 1.237)" 
quantile(d2, c(0.05, 0.5, 0.95))
## [1] 0 1 3

Note that log_pdf() can replicate logLik() value:

sum(log_pdf(prodist(m2), FIFA2018$goals))
## [1] -177.6961
logLik(m2)
## 'log Lik.' -177.6961 (df=2.005144)

zeileis commented 5 months ago

Bug: This has nothing to do with my PR. But working with the example above, I noticed that currently the summary() method does not work for m2, maybe because there is just one model parameter?

summary(m2)
## Error in if (any(j <- (ps == "p") & (nx == i))) { : 
##   missing value where TRUE/FALSE needed

freezenik commented 5 months ago

Thanks you @zeileis! If fixed the summary problem now.

freezenik commented 5 months ago

Also great that you coded the prodist! I merge the PR now!

zeileis commented 5 months ago

Thanks for the PR merge and the quick summary() fix!

We should still revisit the following items at some point (but not necessarily now):

Default type in predict.gamlss2().
Dispatch of distribution methods: Either to the functions in gamlss.dist through their name (current implementation, via <gamlss2_object>$family$family) or to the functions directly stored in <gamlss2_object>$family. I'm not sure what the relative (dis)advantages are.
Appropriate method of newresponse.gamlss2() in topmodels or improving defaults that they work.

Maybe it's also easier to first discuss these in person.

freezenik commented 5 months ago

Great, thanks @zeileis for the input! I think this is also in line with @mstasinopoulos to predict the parameters of the distribution per default. This is the new default now.

I also added a newresponse.gamlss2() in prodist.R, hope that works?

Considering the distribution methods, I really not sure yet what to do ...

zeileis commented 5 months ago

Thanks, Niki!

For now the newresponse() is ok but I will likely make changes to this in topmodels in the next weeks. Maybe a dedicated method is then not needed.

Regarding the distribution methods: Before we do anything, it's better to discuss, I think. Given that the current solution works for the families in gamlss.dist this is not so urgent. Hopefully I'll be in the office tomorrow.

gamlss-dev / gamlss2

Interface for distributions3 infrastructure, enabling topmodels infrastructure #1