gamlss-dev / gamlss2

gamlss2: GAMLSS Infrastructure for Flexible Distributional Regression
https://gamlss-dev.github.io/gamlss2/
5 stars 1 forks source link

Interface for distributions3 infrastructure, enabling topmodels infrastructure #1

Closed zeileis closed 5 months ago

zeileis commented 5 months ago

Niki @freezenik, in this PR I have done the following:

zeileis commented 5 months ago

Additional remark: I think that the default for the predict() method is a bit unintuitive. While it is consistent with the method for glm objects, I've always felt that this is confusing in practice and in teaching. Personally, I would either use type = "parameter" because that describes the full predicted probability distribution. Alternatively, type = "response" might feel intuitive for those who mainly look for a mean regression model.

zeileis commented 5 months ago

Illustration: After merging the PR, you can work with gamlss2 in combination with the topmodels package from R-Forge (see: https://topmodels.R-Forge.R-project.org/).

Model:

library("gamlss2")
m <- gamlss2(rent ~ s(area) + s(yearc) + location + bath + cheating | s(area) + s(yearc) + location + bath + cheating,
  data = rent99, family = GA)
## GAMLSS-RS iteration  3: Global Deviance = 38257.0504 eps = 0.000005 
logLik(m)
## 'log Lik.' -19128.53 (df=24.69217)

Scoring rules:

library("topmodels")
proscore(m, type = "logLik", aggregate = sum)
##      logLik
## 1 -19128.53
proscore(m, type = c("logLik", "CRPS", "MAE", "MSE")) |>
  head(n = 3)
##      logLik     CRPS      MAE         MSE
## 1 -5.543098 46.49766 85.41233 7295.265874
## 2 -5.408971 23.87493 33.87696 1147.648170
## 3 -5.342137 19.37924  1.71472    2.940265

(In principle, these also work with a newdata argument. But currently extracting the newresponse() does not work for gamlss2, yet. Will add that later.)

Probabilistic forecasting: Full distribution objects, moments, probabilities, quantiles.

nd <- data.frame(area = 5:8 * 10, yearc = 1990, location = "1", bath = "1", cheating = "1")
procast(m, newdata = nd)
##                                                distribution
## 1 GAMLSS GA distribution (mu = 495.6917, sigma = 0.2213075)
## 2 GAMLSS GA distribution (mu = 575.9095, sigma = 0.2238684)
## 3 GAMLSS GA distribution (mu = 634.3635, sigma = 0.2264513)
## 4 GAMLSS GA distribution (mu = 679.2961, sigma = 0.2290589)
procast(m, newdata = nd, type = "variance")
##   variance
## 1 12034.15
## 2 16622.41
## 3 20636.02
## 4 24210.99
procast(m, newdata = nd, type = "cdf", at = 750, lower.tail = FALSE)
##   probability
## 1  0.01871957
## 2  0.09558729
## 3  0.20180883
## 4  0.30315932
procast(m, newdata = nd, type = "quantile", at = c(0.05, 0.5, 0.95))
##     q_0.05    q_0.5   q_0.95
## 1 330.0660 487.6231 688.8469
## 2 381.4687 566.3177 803.0775
## 3 417.9578 623.5536 887.6522
## 4 445.1584 667.4533 953.8415

Goodness-of-fit graphics: Hanging rootogram, PIT histogram, residual Q-Q plot, worm plot (detrended Q-Q plot). Either via base graphics or via ggplot2.

rootogram(m)
pithist(m)
qqrplot(m)
wormplot(m)

topmodels goodness-of-fit graphics for gamlss2

library("ggplot2")
rootogram(m)
pithist(m)
qqrplot(m)
wormplot(m)

topmodels goodness-of-fit graphics for gamlss2 via ggplot2

zeileis commented 5 months ago

Illustration: The manual page ?prodist.gamlss2 also shows how to work with the distributions3 objects "by hand" (rather than via the high-level interfaces in topmodels).

Packages, code, and data:

library("distributions3")
data("cars", package = "datasets")

Fit heteroscedastic normal GAMLSS model: Stopping distance (ft) explained by speed (mph).

m <- gamlss2(dist ~ s(speed) | s(speed), data = cars, family = NO)
## GAMLSS-RS iteration  4: Global Deviance = 405.6976 eps = 0.000000    

Obtain predicted distributions for three levels of speed:

d <- prodist(m, newdata = data.frame(speed = c(10, 20, 30)))
print(d)
##                                                    1 
## "GAMLSS NO distribution (mu = 23.04, sigma = 10.06)" 
##                                                    2 
## "GAMLSS NO distribution (mu = 59.04, sigma = 18.51)" 
##                                                    3 
## "GAMLSS NO distribution (mu = 96.35, sigma = 33.95)" 

Obtain quantiles (works the same for any distribution object d !):

quantile(d, 0.5)
##        1        2        3 
## 23.03912 59.03607 96.34894 
quantile(d, c(0.05, 0.5, 0.95), elementwise = FALSE)
##      q_0.05    q_0.5    q_0.95
## 1  6.486965 23.03912  39.59128
## 2 28.589634 59.03607  89.48250
## 3 40.504859 96.34894 152.19302
quantile(d, c(0.05, 0.5, 0.95), elementwise = TRUE)
##          1          2          3 
##   6.486965  59.036068 152.193022 

Visualization:

plot(dist ~ speed, data = cars)
nd <- data.frame(speed = 0:240/4)
nd$dist <- prodist(m, newdata = nd)
nd$fit <- quantile(nd$dist, c(0.05, 0.5, 0.95))
matplot(nd$speed, nd$fit, type = "l", lty = 1, col = "slategray", add = TRUE)

prodist example for cars cara

Moments:

mean(d)
##        1        2        3 
## 23.03912 59.03607 96.34894 
variance(d)
##         1         2         3 
##  101.2639  342.6244 1152.6562 

Simulate random numbers:

random(d, 5)
##        r_1       r_2      r_3        r_4       r_5
## 1 24.91776  25.49919 25.40398 25.5270449  16.98507
## 2 76.66499  86.89972 24.54569 84.5080020  52.88821
## 3 99.13546 116.57334 84.41239  0.9694538 108.17023

Density and distribution:

pdf(d, 50 * -2:2)
##         d_-100        d_-50          d_0        d_50        d_100
## 1 1.365773e-34 1.440745e-13 0.0028836926 0.001095127 7.891025e-15
## 2 2.012488e-18 6.289571e-10 0.0001332378 0.019131663 1.862073e-03
## 3 6.414069e-10 1.084305e-06 0.0002095207 0.004627637 1.168285e-02
cdf(d, 50 * -2:2)
##         p_-100        p_-50          p_0       p_50     p_100
## 1 1.116688e-34 1.961558e-13 0.0110254770 0.99631019 1.0000000
## 2 4.279173e-18 1.923747e-09 0.0007128557 0.31271502 0.9865531
## 3 3.661609e-09 8.139892e-06 0.0022705722 0.08609824 0.5428196

Poisson example:

data("FIFA2018", package = "distributions3")
m2 <- gamlss2(goals ~ s(difference), data = FIFA2018, family = PO)
## GAMLSS-RS iteration  2: Global Deviance = 355.3922 eps = 0.000000    
d2 <- prodist(m2, newdata = data.frame(difference = 0))
print(d2)
##                                     1 
## "GAMLSS PO distribution (mu = 1.237)" 
quantile(d2, c(0.05, 0.5, 0.95))
## [1] 0 1 3

Note that log_pdf() can replicate logLik() value:

sum(log_pdf(prodist(m2), FIFA2018$goals))
## [1] -177.6961
logLik(m2)
## 'log Lik.' -177.6961 (df=2.005144)
zeileis commented 5 months ago

Bug: This has nothing to do with my PR. But working with the example above, I noticed that currently the summary() method does not work for m2, maybe because there is just one model parameter?

summary(m2)
## Error in if (any(j <- (ps == "p") & (nx == i))) { : 
##   missing value where TRUE/FALSE needed
freezenik commented 5 months ago

Thanks you @zeileis! If fixed the summary problem now.

freezenik commented 5 months ago

Also great that you coded the prodist! I merge the PR now!

zeileis commented 5 months ago

Thanks for the PR merge and the quick summary() fix!

We should still revisit the following items at some point (but not necessarily now):

Maybe it's also easier to first discuss these in person.

freezenik commented 5 months ago

Great, thanks @zeileis for the input! I think this is also in line with @mstasinopoulos to predict the parameters of the distribution per default. This is the new default now.

I also added a newresponse.gamlss2() in prodist.R, hope that works?

Considering the distribution methods, I really not sure yet what to do ...

zeileis commented 5 months ago

Thanks, Niki!

For now the newresponse() is ok but I will likely make changes to this in topmodels in the next weeks. Maybe a dedicated method is then not needed.

Regarding the distribution methods: Before we do anything, it's better to discuss, I think. Given that the current solution works for the families in gamlss.dist this is not so urgent. Hopefully I'll be in the office tomorrow.