wrong details contained for `mgcv::gam` output from `model_parameters`

IndrajeetPatil commented 3 years ago

edf (estimated degrees of freedom) here are contained in the Coefficient column for Smooth Terms.

Maybe we can rename these columns to df1 and df2, and order them as Parameter, F, df1, df2, and p.value?

set.seed(2) 
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.

dat <- gamSim(1, n = 400, dist = "normal", scale = 2)
#> Gu & Wahba 4 term additive model

b <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat)

summary(b)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> y ~ s(x0) + s(x1) + s(x2) + s(x3)
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  7.83328    0.09878    79.3   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>         edf Ref.df      F  p-value    
#> s(x0) 2.500  3.115  6.921 0.000131 ***
#> s(x1) 2.401  2.984 81.858  < 2e-16 ***
#> s(x2) 7.698  8.564 88.158  < 2e-16 ***
#> s(x3) 1.000  1.000  4.343 0.037818 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.715   Deviance explained = 72.5%
#> GCV = 4.0505  Scale est. = 3.9027    n = 400

parameters::model_parameters(b)
#> # Fixed Effects
#> 
#> Parameter   | Coefficient |   SE |       95% CI | t(385.40) |      p
#> --------------------------------------------------------------------
#> (Intercept) |        7.83 | 0.10 | [7.64, 8.03] |     79.30 | < .001
#> 
#> # Smooth Terms
#> 
#> Parameter        | Coefficient |     F |   df |      p
#> ------------------------------------------------------
#> Smooth term (x0) |        2.50 |  6.92 | 3.12 | < .001
#> Smooth term (x1) |        2.40 | 81.86 | 2.98 | < .001
#> Smooth term (x2) |        7.70 | 88.16 | 8.56 | < .001
#> Smooth term (x3) |        1.00 |  4.34 | 1.00 | 0.038

^{Created on 2020-11-27 by the reprex package (v0.3.0)}

Session info

``` r devtools::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.3 (2020-10-10) #> os macOS Mojave 10.14.6 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Berlin #> date 2020-11-27 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2) #> bayestestR 0.7.5.1 2020-11-27 [1] Github (easystats/bayestestR@ba68c88) #> callr 3.5.1 2020-10-13 [1] CRAN (R 4.0.2) #> cli 2.2.0 2020-11-20 [1] CRAN (R 4.0.3) #> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2) #> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.2) #> devtools 2.3.2 2020-09-18 [1] CRAN (R 4.0.2) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.1) #> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.2) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2) #> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2) #> insight 0.11.0.1 2020-11-27 [1] Github (easystats/insight@7639faf) #> knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2) #> lattice 0.20-41 2020-04-02 [2] CRAN (R 4.0.3) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3) #> Matrix 1.2-18 2019-11-27 [2] CRAN (R 4.0.3) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.2) #> mgcv * 1.8-33 2020-08-27 [2] CRAN (R 4.0.3) #> nlme * 3.1-149 2020-08-23 [2] CRAN (R 4.0.3) #> parameters 0.9.0.1 2020-11-27 [1] Github (easystats/parameters@abc447c) #> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.2) #> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.2) #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.2) #> processx 3.4.4 2020-09-03 [1] CRAN (R 4.0.2) #> ps 1.4.0 2020-10-07 [1] CRAN (R 4.0.2) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2) #> remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2) #> rlang 0.4.9 2020-11-26 [1] CRAN (R 4.0.3) #> rmarkdown 2.5 2020-10-21 [1] CRAN (R 4.0.3) #> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.3) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2) #> testthat 3.0.0 2020-10-31 [1] CRAN (R 4.0.2) #> usethis 1.6.3 2020-09-17 [1] CRAN (R 4.0.2) #> withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2) #> xfun 0.19 2020-10-30 [1] CRAN (R 4.0.2) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) #> #> [1] /Users/patil/Library/R/4.0/library #> [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library ```

strengejacke commented 3 years ago

@mattansb You're the expert in df-names, what do you think?

mattansb commented 3 years ago

I'm not sure what the problem is that we're trying to address here; @IndrajeetPatil can you elaborate?

strengejacke commented 3 years ago

The smooth part in the summary has columns edf and ref.df. the EDF (estimated df) is named Coefficient, while Res df are names df.

@IndrajeetPatil was suggesting naming those df1 and df2.

However, reading https://r.789695.n4.nabble.com/ref-df-in-mgcv-gam-td4756194.html we maybe can rename EDF to df_estimated and omit Res. df? Or what would you suggest?

IndrajeetPatil commented 3 years ago

I would retain both of them as it would be weird to have an F-statistic with just one degree of freedom. In fact, that's what caught my attention - the fact that we had F-statistic here but just one df column.

mattansb commented 3 years ago

From what I can tell, the ref.df isn't the usual denominator df of the F test. Instead:

efd = nominator df
residual df = denominator df

set.seed(2) 
library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.

dat <- gamSim(1, n = 400, dist = "normal", scale = 2)
#> Gu & Wahba 4 term additive model

b <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat)

s.table <- summary(b)$s.table

p1 <- pf(s.table[,3], df1 = s.table[,1], df2 = s.table[,2], lower.tail = F)
cbind(round(p1, 5),
      round(s.table[,4], 5))
#>          [,1]    [,2]
#> s(x0) 0.07012 0.00013
#> s(x1) 0.00239 0.00000
#> s(x2) 0.00000 0.00000
#> s(x3) 0.28482 0.03782

Not the same p-values.

However:

p1 <- pf(s.table[,3], df1 = s.table[,1], df2 = df.residual(b), lower.tail = F)
cbind(round(p1, 5),
      round(s.table[,4], 5))
#>          [,1]    [,2]
#> s(x0) 0.00041 0.00013
#> s(x1) 0.00000 0.00000
#> s(x2) 0.00000 0.00000
#> s(x3) 0.03782 0.03782

^{Created on 2020-11-28 by the reprex package (v0.3.0)}

IndrajeetPatil commented 3 years ago

Hmm, I think then we can use this existing naming schema: df_num and df_denom

as.data.frame(parameters::model_parameters(oneway.test(extra ~ group, data = sleep)))
#>          F df_num df_denom          p
#> 1 3.462627      1 17.77647 0.07939414
#>                                                     Method
#> 1 One-way analysis of means (not assuming equal variances)

^{Created on 2020-11-28 by the reprex package (v0.3.0)}

mattansb commented 3 years ago

Hmmm.... except for one way anova they should IMO df and df_error (also for t.test - should be df_error, if we want to keep it consistent...)

mattansb commented 3 years ago

In the general linear model framework, there two types of dfs:

Model df: the number of parameters minus 1 (or some approx of this, such as edf).
The error term's df (or some approx of this, such as in LMMs).

For t, there is only really the latter (the model df is always 1). For F you need both.

IMO we should be consistently calling them df and df_error across the easyverse.

IndrajeetPatil commented 3 years ago

I like it!

I think the num.df and denom.df choice might have been influenced by broom:

broom::tidy(oneway.test(extra ~ group, data = sleep))
#> Multiple parameters; naming those columns num.df, den.df
#> # A tibble: 1 x 5
#>   num.df den.df statistic p.value method                                        
#>    <dbl>  <dbl>     <dbl>   <dbl> <chr>                                         
#> 1      1   17.8      3.46  0.0794 One-way analysis of means (not assuming equal~

^{Created on 2020-11-28 by the reprex package (v0.3.0.9001)}

But we don't have to abide by their naming schema, especially with a strong statistical reasoning you so elegantly outlined to back up our choices! :)

mattansb commented 3 years ago

I am pretty elegant, aren't I?

(I think broom is coming from a purely statistical standpoint, whereas I see us gearing towards a broader audience ...)

strengejacke commented 3 years ago

So would you recommend reporting edf and df.residual, and omit res.df?

mattansb commented 3 years ago

I think so, yes.

strengejacke commented 3 years ago

@IndrajeetPatil can you please check if it now also works for models of class scam and gamlss?

IndrajeetPatil commented 3 years ago

Huh, interestingly, the same issue is also true for scam objects. Maybe we should be using the same method as for mgcv::gam?

# setup
set.seed(123)
library(scam)
#> Loading required package: mgcv
#> Loading required package: nlme
#> This is mgcv 1.8-33. For overview type 'help("mgcv-package")'.
#> This is scam 1.2-8.

# data
n <- 200
x1 <- runif(n) * 6 - 3
f1 <- 3 * exp(-x1^2) # unconstrained term
f1 <- (f1 - min(f1)) / (max(f1) - min(f1)) # function scaled to have range [0,1]
x2 <- runif(n) * 4 - 1
f2 <- exp(4 * x2) / (1 + exp(4 * x2)) # monotone increasing smooth
f2 <- (f2 - min(f2)) / (max(f2) - min(f2)) # function scaled to have range [0,1]
f <- f1 + f2
y <- f + rnorm(n) * 0.1
dat <- data.frame(x1 = x1, x2 = x2, y = y)

# model
b <-
  scam(
    y ~ s(x1, k = 15, bs = "cr", m = 2) + s(x2, k = 25, bs = "mpi", m = 2),
    family = gaussian(link = "identity"),
    data = dat,
    not.exp = FALSE
  )

summary(b)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> y ~ s(x1, k = 15, bs = "cr", m = 2) + s(x2, k = 25, bs = "mpi", 
#>     m = 2)
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   0.2562     0.0703   3.645 0.000347 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>         edf Ref.df     F p-value    
#> s(x1) 8.931 10.615 223.4  <2e-16 ***
#> s(x2) 5.620  6.785 380.1  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.9649   Deviance explained = 96.8%
#> GCV score = 0.010902  Scale est. = 0.010054  n = 200

parameters::model_parameters(b)
#> # Fixed Effects
#> 
#> Parameter   | Coefficient |   SE |       95% CI | Statistic |     df |      p
#> -----------------------------------------------------------------------------
#> (Intercept) |        0.26 | 0.07 | [0.12, 0.39] |      3.65 | 184.45 | < .001
#> 
#> # Smooth Terms
#> 
#> Parameter        | Coefficient | Statistic |     df |      p
#> ------------------------------------------------------------
#> Smooth term (x1) |        8.93 |    223.43 | 184.45 | < .001
#> Smooth term (x2) |        5.62 |    380.05 | 184.45 | < .001

^{Created on 2020-11-29 by the reprex package (v0.3.0)}

IndrajeetPatil commented 3 years ago

Things look good with gamlss.

# setup
set.seed(123)
library(gamlss)
#> Loading required package: splines
#> Loading required package: gamlss.data
#> 
#> Attaching package: 'gamlss.data'
#> The following object is masked from 'package:datasets':
#> 
#>     sleep
#> Loading required package: gamlss.dist
#> Loading required package: MASS
#> Loading required package: nlme
#> Loading required package: parallel
#>  **********   GAMLSS Version 5.2-0  **********
#> For more on GAMLSS look at https://www.gamlss.com/
#> Type gamlssNews() to see new features/changes/bug fixes.

# model
g <-
  gamlss::gamlss(
    formula = y ~ pb(x),
    sigma.fo = ~ pb(x),
    family = BCT,
    data = abdom,
    method = mixed(1, 20)
  )
#> GAMLSS-RS iteration 1: Global Deviance = 4771.925 
#> GAMLSS-CG iteration 1: Global Deviance = 4771.013 
#> GAMLSS-CG iteration 2: Global Deviance = 4770.994 
#> GAMLSS-CG iteration 3: Global Deviance = 4770.994

summary(g)
#> ******************************************************************
#> Family:  c("BCT", "Box-Cox t") 
#> 
#> Call:  gamlss::gamlss(formula = y ~ pb(x), sigma.formula = ~pb(x),  
#>     family = BCT, data = abdom, method = mixed(1, 20)) 
#> 
#> Fitting method: mixed(1, 20) 
#> 
#> ------------------------------------------------------------------
#> Mu link function:  identity
#> Mu Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -64.44299    1.33397  -48.31   <2e-16 ***
#> pb(x)        10.69464    0.05787  184.80   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> ------------------------------------------------------------------
#> Sigma link function:  log
#> Sigma Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -2.65041    0.10859 -24.407  < 2e-16 ***
#> pb(x)       -0.01002    0.00380  -2.638  0.00855 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> ------------------------------------------------------------------
#> Nu link function:  identity 
#> Nu Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)
#> (Intercept)  -0.1072     0.6296   -0.17    0.865
#> 
#> ------------------------------------------------------------------
#> Tau link function:  log 
#> Tau Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   2.4948     0.4261   5.855 7.86e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> ------------------------------------------------------------------
#> NOTE: Additive smoothing terms exist in the formulas: 
#>  i) Std. Error for smoothers are for the linear effect only. 
#> ii) Std. Error for the linear terms maybe are not accurate. 
#> ------------------------------------------------------------------
#> No. of observations in the fit:  610 
#> Degrees of Freedom for the fit:  11.7603
#>       Residual Deg. of Freedom:  598.2397 
#>                       at cycle:  3 
#>  
#> Global Deviance:     4770.994 
#>             AIC:     4794.515 
#>             SBC:     4846.419 
#> ******************************************************************

parameters::model_parameters(g)
#> # Fixed Effects
#> 
#> Parameter   | Coefficient |   SE |           95% CI | t(598.24) |      p
#> ------------------------------------------------------------------------
#> (Intercept) |      -64.44 | 1.33 | [-67.06, -61.83] |    -48.31 | < .001
#> x           |       10.69 | 0.06 | [ 10.58,  10.81] |    184.80 | < .001
#> 
#> # Sigma
#> 
#> Parameter   | Coefficient |       SE |         95% CI | t(598.24) |      p
#> --------------------------------------------------------------------------
#> (Intercept) |       -2.65 |     0.11 | [-2.86, -2.44] |    -24.41 | < .001
#> x           |       -0.01 | 3.80e-03 | [-0.02,  0.00] |     -2.64 | 0.009 
#> 
#> # Nu
#> 
#> Parameter   | Coefficient |   SE |        95% CI | t(598.24) |     p
#> --------------------------------------------------------------------
#> (Intercept) |       -0.11 | 0.63 | [-1.34, 1.13] |     -0.17 | 0.865
#> 
#> # Tau
#> 
#> Parameter   | Coefficient |   SE |       95% CI | t(598.24) |      p
#> --------------------------------------------------------------------
#> (Intercept) |        2.49 | 0.43 | [1.66, 3.33] |      5.86 | < .001

^{Created on 2020-11-29 by the reprex package (v0.3.0)}

easystats / parameters

wrong details contained for `mgcv::gam` output from `model_parameters` #353