gamlss-dev / gamlss2

gamlss2: GAMLSS Infrastructure for Flexible Distributional Regression
https://gamlss-dev.github.io/gamlss2/
5 stars 1 forks source link

node stack overflow & feedback #5

Closed tmspvn closed 1 month ago

tmspvn commented 1 month ago

Dear developers,

last April I made a post about anonymizing the gamlss model and it was suggested to move to gamlss2. The suggestion was good and since I've promised a feedback here it is:

The size of the gamlss2 object with the light=TRUE is still to big for exporting several models. To reduce it I set the $family field to the family name instead of the family object that accounts 1/3 of the object size:

data("abdom", package = "gamlss.data")

# heavy
heavy <- gamlss2(y ~ pb(x) | x, data = abdom, family = BCT)
print(glue("{round(object.size(heavy)/(1024*1024),3)} MB"))
> 3.493 MB

# light
light <- gamlss2(y ~ pb(x) | x, data = abdom, family = BCT, light=T)
print(glue("{round(object.size(light)/(1024*1024),3)} MB"))
> 3.064 MB

# ultra light
ultra <- gamlss2(y ~ pb(x) | x, data = abdom, family = BCT, light=T)
print(glue("ultra$family size: {round(object.size(ultra$family)/(1024*1024),3)} MB"))
> 0.747 MB

# remove family object but keep only it's name
ultra$family <- ultra$family$family
print(glue("After: {round(object.size(ultra)/(1024*1024),3)} MB"))
> 2.317 MB

# reset to original state:
ultra$family <- gamlss2:::complete_family(ultra$family) # or eval() if there's no differences

Furthermore, I've been consistently getting the following error:

set.seed(1)
data("abdom", package = "gamlss.data")
abdom$z <- rbinom(610, 1, 0.5)
f <- y ~ pb(x) * z | x
b <- gamlss2(f, data = abdom, family = BCT)

> Error in (function (condition) : node stack overflow

Which seems related to the interaction of the smooth term and a variable. Using gamlss::pb, gamlss2::pb or s() seems to make no difference. I can't really figure it out how to fix the problem. I found this error with 3 different datasets. What i understood was that "Inter-actions with nonparametric smooth terms are not fully supported, but will not produce errors; they will simply produce the usual parametric interaction" (p.48).

**sessionInfo()** ``` > sessionInfo() R version 4.4.1 (2024-06-14) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 22.04.4 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0 Random number generation: RNG: L'Ecuyer-CMRG Normal: Inversion Sample: Rejection locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=fr_CH.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=fr_CH.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=fr_CH.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=fr_CH.UTF-8 LC_IDENTIFICATION=C time zone: Europe/Zurich tzcode source: system (glibc) attached base packages: [1] parallel splines stats graphics grDevices utils datasets methods base other attached packages: [1] glue_1.7.0 splitTools_1.0.1 doFuture_1.0.1 future_1.33.2 foreach_1.5.2 pbmcapply_1.5.1 [7] progressr_0.14.0 itertools_0.1-3 iterators_1.0.14 ANTsR_0.5.7.5 ANTsRCore_0.7.5 gamlss2_0.1-0 [13] mgcv_1.9-1 gamlss_5.4-22 nlme_3.1-165 gamlss.dist_6.1-1 gamlss.data_6.0-6 loaded via a namespace (and not attached): [1] crayon_1.5.3 cli_3.6.2 rlang_1.1.4 Formula_1.2-5 pkgload_1.3.4 future.apply_1.11.2 [7] listenv_0.9.1 grid_4.4.1 MASS_7.3-61 compiler_4.4.1 ITKR_0.6.0.0.2 codetools_0.2-20 [13] Rcpp_1.0.12 rstudioapi_0.16.0 RcppEigen_0.3.4.0.0 lattice_0.22-6 digest_0.6.35 parallelly_1.37.1 [19] magrittr_2.0.3 Matrix_1.7-0 tools_4.4.1 globals_0.16.3 survival_3.7-0 ```
freezenik commented 1 month ago

Thank you for spotting this, I now fixed your second issue. However, according to the object size, I do not see why the family object size is an issue. E.g.

R> set.seed(123) R> n <- 100000 R> x <- runif(n, -3, 3) R> y <- 10 + sin(x) + rnorm(n, sd = 0.3) R> b1 <- gamlss2(y ~ s(x), family = BCT) GAMLSS-RS iteration 8: Global Deviance = 44472.3394 eps = 0.000005
R> b2 <- gamlss2(y ~ s(x), family = BCT, light = TRUE) Start estimation ... GAMLSS-RS iteration 8: Global Deviance = 44472.3394 eps = 0.000005
R> format(object.size(b1), units = "Mb") [1] "62.8 Mb" R> format(object.size(b1$family), units = "Mb") [1] "0.7 Mb" R> format(object.size(b2), units = "Mb") [1] "3.2 Mb" R> format(object.size(b2$family), units = "Mb") [1] "0.7 Mb"

tmspvn commented 1 month ago

Thanks a lot for addressing the problems quickly.

I don't think it's an issue per se but if you have a large number of models, 20% memory less is a significant amount. Anyway, it's good to post it for whoever will have a similar situation to mine