CausalInference / gfoRmula

The gfoRmula package implements the parametric g-formula in R. The parametric g-formula (Robins, 1986) uses longitudinal data with time-varying treatments and confounders to estimate the risk or mean of an outcome under hypothetical treatment strategies specified by the user.
150 stars 62 forks source link

Restricted cubic splines #39

Open Lachlan-Cribb opened 2 days ago

Lachlan-Cribb commented 2 days ago

Thanks for the great package!

The paper in Patterns and the vignettes describe modelling variables with restricted cubic splines, using the rcspline.eval() function from Hmisc. I've noticed that rcspline.eval() isn't returning what I would expect. For example, if you fit a model including a spline with three knots, only a single parameter is estimated, rather than the expected 2.

Here is an example putting a spline on L2 in the model for A:

library(gfoRmula)
library(Hmisc)

id <- "id"
time_points <- 7
time_name <- "t0"
covnames <- c("L1", "L2", "A")
outcome_name <- "Y"
outcome_type <- "survival"
covtypes <- c("binary", "bounded normal", "binary")
histories <- c(lagged, lagavg)
histvars <- list(c("A", "L1", "L2"), c("L1", "L2"))
covparams <- list(covmodels = c(
  L1 ~ lag1_A + lag_cumavg1_L1 + lag_cumavg1_L2 +
    L3 + t0,
  L2 ~ lag1_A + L1 + lag_cumavg1_L1 +
    lag_cumavg1_L2 + L3 + t0,
  A ~ lag1_A + L1 + rcspline.eval(L2, knots = c(-1,0,1)) + lag_cumavg1_L1 +
    lag_cumavg1_L2 + L3 + t0
))
ymodel <- Y ~ A + L1 + L2 + L3 + lag1_A + lag1_L1 + lag1_L2 + t0
intervention1.A <- list(static, rep(0, time_points))
intervention2.A <- list(static, rep(1, time_points))
int_descript <- c("Never treat", "Always treat")
nsimul <- 10000

gform_basic <- gformula(
  obs_data = basicdata_nocomp, id = id,
  time_points = time_points,
  time_name = time_name, covnames = covnames,
  outcome_name = outcome_name,
  outcome_type = outcome_type, covtypes = covtypes,
  covparams = covparams, ymodel = ymodel,
  intervention1.A = intervention1.A,
  intervention2.A = intervention2.A,
  int_descript = int_descript,
  histories = histories, histvars = histvars,
  basecovs = c("L3"), nsimul = nsimul,
  seed = 1234,
  model_fits = TRUE
)

This executes without problem, but looking at the model fit, there is only a single parameter for L2.

> summary(gform_basic$fits$A)

Call:
stats::glm(formula = stats::as.formula(paste(covmodels[j])), 
    family = eval(parse(text = famtext)), data = obs_data, y = TRUE)

Coefficients:
                                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)                            -0.0439262  0.4220692  -0.104    0.917    
lag1_A                                  0.5131417  0.4082792   1.257    0.209    
L1                                      0.4418390  0.0473139   9.338   <2e-16 ***
rcspline.eval(L2, knots = c(-1, 0, 1))  0.3878431  0.2707082   1.433    0.152    
lag_cumavg1_L1                          0.0312255  0.0633624   0.493    0.622    
lag_cumavg1_L2                          0.0252796  0.0509059   0.497    0.619    
L3                                      0.0004867  0.0118709   0.041    0.967    
t0                                      0.0106127  0.0166874   0.636    0.525    
---

On the other hand, using the rcs() function from the rms package seems to work, giving the expected two parameters:

covparams <- list(covmodels = c(
  L1 ~ lag1_A + lag_cumavg1_L1 + lag_cumavg1_L2 +
    L3 + t0,
  L2 ~ lag1_A + L1 + lag_cumavg1_L1 +
    lag_cumavg1_L2 + L3 + t0,
  A ~ lag1_A + L1 + rcs(L2, c(-1,0,1)) + lag_cumavg1_L1 +
    lag_cumavg1_L2 + L3 + t0
))
gform_basic <- gformula(
  obs_data = basicdata_nocomp, id = id,
  time_points = time_points,
  time_name = time_name, covnames = covnames,
  outcome_name = outcome_name,
  outcome_type = outcome_type, covtypes = covtypes,
  covparams = covparams, ymodel = ymodel,
  intervention1.A = intervention1.A,
  intervention2.A = intervention2.A,
  int_descript = int_descript,
  histories = histories, histvars = histvars,
  basecovs = c("L3"), nsimul = nsimul,
  seed = 1234,
  model_fits = TRUE
)

> summary(gform_basic$fits$A)

Call:
stats::glm(formula = stats::as.formula(paste(covmodels[j])), 
    family = eval(parse(text = famtext)), data = obs_data, y = TRUE)

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -0.0448496  0.4220782  -0.106    0.915    
lag1_A                   0.4160439  0.4733874   0.879    0.379    
L1                       0.4418991  0.0473143   9.340   <2e-16 ***
rcs(L2, c(-1, 0, 1))L2  -0.0982649  0.2425120  -0.405    0.685    
rcs(L2, c(-1, 0, 1))L2'  0.4540280  0.3161731   1.436    0.151    
lag_cumavg1_L1           0.0312177  0.0633616   0.493    0.622    
lag_cumavg1_L2           0.0249991  0.0509117   0.491    0.623    
L3                       0.0004871  0.0118710   0.041    0.967    
t0                       0.0105864  0.0166879   0.634    0.526    
---

Session info:

> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Pop!_OS 22.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8       
 [4] LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

time zone: Australia/Melbourne
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rms_6.8-1      Hmisc_5.1-3    gfoRmula_1.1.0

loaded via a namespace (and not attached):
 [1] gtable_0.3.5       xfun_0.44          ggplot2_3.5.1      htmlwidgets_1.6.4 
 [5] lattice_0.22-6     vctrs_0.6.5        tools_4.4.1        generics_0.1.3    
 [9] sandwich_3.1-0     tibble_3.2.1       fansi_1.0.6        highr_0.11        
[13] cluster_2.1.6      pkgconfig_2.0.3    R.oo_1.26.0        Matrix_1.7-0      
[17] data.table_1.15.4  checkmate_2.3.1    lifecycle_1.0.4    R.cache_0.16.0    
[21] compiler_4.4.1     stringr_1.5.1      MatrixModels_0.5-3 munsell_0.5.1     
[25] codetools_0.2-20   SparseM_1.83       quantreg_5.98      htmltools_0.5.8.1 
[29] htmlTable_2.4.2    Formula_1.2-5      pillar_1.9.0       MASS_7.3-61       
[33] R.utils_2.12.3     rpart_4.1.23       multcomp_1.4-25    nlme_3.1-165      
[37] styler_1.10.3      tidyselect_1.2.1   digest_0.6.35      mvtnorm_1.2-5     
[41] polspline_1.1.25   stringi_1.8.4      dplyr_1.1.4        purrr_1.0.2       
[45] splines_4.4.1      fastmap_1.2.0      grid_4.4.1         colorspace_2.1-0  
[49] cli_3.6.2          magrittr_2.0.3     base64enc_0.1-3    survival_3.7-0    
[53] utf8_1.2.4         TH.data_1.1-2      foreign_0.8-86     withr_3.0.0       
[57] scales_1.3.0       backports_1.5.0    rmarkdown_2.27     nnet_7.3-19       
[61] gridExtra_2.3      zoo_1.8-12         R.methodsS3_1.8.2  evaluate_0.24.0   
[65] knitr_1.47         rlang_1.1.4        glue_1.7.0         renv_1.0.7        
[69] rstudioapi_0.16.0  R6_2.5.1  
rwlogan commented 2 days ago

HI, I found in the beta testing that with Hmisc and the rcspline function you might need to use Hmisc::rcspline.eval instead. I am not sure why. Roger


From: Lachlan Cribb @.> Sent: Monday, October 14, 2024 4:15 AM To: CausalInference/gfoRmula @.> Cc: Subscribed @.***> Subject: [CausalInference/gfoRmula] Restricted cubic splines (Issue #39)

Thanks for the great package!

The paper in Patterns and the vignettes describe modelling variables with restricted cubic splines, using the rcspline.eval() function from Hmisc. I've noticed that rcspline.eval() isn't returning what I would expect. For example, if you fit a model including a spline with three knots, only a single parameter is estimated, rather than the expected 2.

Here is an example putting a spline on L2 in the model for A:

library(gfoRmula) library(Hmisc)

id <- "id" time_points <- 7 time_name <- "t0" covnames <- c("L1", "L2", "A") outcome_name <- "Y" outcome_type <- "survival" covtypes <- c("binary", "bounded normal", "binary") histories <- c(lagged, lagavg) histvars <- list(c("A", "L1", "L2"), c("L1", "L2")) covparams <- list(covmodels = c( L1 ~ lag1_A + lag_cumavg1_L1 + lag_cumavg1_L2 + L3 + t0, L2 ~ lag1_A + L1 + lag_cumavg1_L1 + lag_cumavg1_L2 + L3 + t0, A ~ lag1_A + L1 + rcspline.eval(L2, knots = c(-1,0,1)) + lag_cumavg1_L1 + lag_cumavg1_L2 + L3 + t0 )) ymodel <- Y ~ A + L1 + L2 + L3 + lag1_A + lag1_L1 + lag1_L2 + t0 intervention1.A <- list(static, rep(0, time_points)) intervention2.A <- list(static, rep(1, time_points)) int_descript <- c("Never treat", "Always treat") nsimul <- 10000

gform_basic <- gformula( obs_data = basicdata_nocomp, id = id, time_points = time_points, time_name = time_name, covnames = covnames, outcome_name = outcome_name, outcome_type = outcome_type, covtypes = covtypes, covparams = covparams, ymodel = ymodel, intervention1.A = intervention1.A, intervention2.A = intervention2.A, int_descript = int_descript, histories = histories, histvars = histvars, basecovs = c("L3"), nsimul = nsimul, seed = 1234, model_fits = TRUE )

This executes without problem, but looking at the model fit, there is only a single parameter for L2.

summary(gform_basic$fits$A)

Call: stats::glm(formula = stats::as.formula(paste(covmodels[j])), family = eval(parse(text = famtext)), data = obs_data, y = TRUE)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.0439262 0.4220692 -0.104 0.917 lag1_A 0.5131417 0.4082792 1.257 0.209 L1 0.4418390 0.0473139 9.338 <2e-16 *** rcspline.eval(L2, knots = c(-1, 0, 1)) 0.3878431 0.2707082 1.433 0.152 lag_cumavg1_L1 0.0312255 0.0633624 0.493 0.622 lag_cumavg1_L2 0.0252796 0.0509059 0.497 0.619 L3 0.0004867 0.0118709 0.041 0.967 t0 0.0106127 0.0166874 0.636 0.525

On the other hand, using the rcs() function from the rms package seems to work, giving the expected two parameters:

covparams <- list(covmodels = c( L1 ~ lag1_A + lag_cumavg1_L1 + lag_cumavg1_L2 + L3 + t0, L2 ~ lag1_A + L1 + lag_cumavg1_L1 + lag_cumavg1_L2 + L3 + t0, A ~ lag1_A + L1 + rcs(L2, c(-1,0,1)) + lag_cumavg1_L1 + lag_cumavg1_L2 + L3 + t0 )) gform_basic <- gformula( obs_data = basicdata_nocomp, id = id, time_points = time_points, time_name = time_name, covnames = covnames, outcome_name = outcome_name, outcome_type = outcome_type, covtypes = covtypes, covparams = covparams, ymodel = ymodel, intervention1.A = intervention1.A, intervention2.A = intervention2.A, int_descript = int_descript, histories = histories, histvars = histvars, basecovs = c("L3"), nsimul = nsimul, seed = 1234, model_fits = TRUE )

summary(gform_basic$fits$A)

Call: stats::glm(formula = stats::as.formula(paste(covmodels[j])), family = eval(parse(text = famtext)), data = obs_data, y = TRUE)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.0448496 0.4220782 -0.106 0.915 lag1_A 0.4160439 0.4733874 0.879 0.379 L1 0.4418991 0.0473143 9.340 <2e-16 *** rcs(L2, c(-1, 0, 1))L2 -0.0982649 0.2425120 -0.405 0.685 rcs(L2, c(-1, 0, 1))L2' 0.4540280 0.3161731 1.436 0.151 lag_cumavg1_L1 0.0312177 0.0633616 0.493 0.622 lag_cumavg1_L2 0.0249991 0.0509117 0.491 0.623 L3 0.0004871 0.0118710 0.041 0.967 t0 0.0105864 0.0166879 0.634 0.526

Session info:

sessionInfo() R version 4.4.1 (2024-06-14) Platform: x86_64-pc-linux-gnu Running under: Pop!_OS 22.04 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8 [4] LC_COLLATE=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C

time zone: Australia/Melbourne tzcode source: system (glibc)

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] rms_6.8-1 Hmisc_5.1-3 gfoRmula_1.1.0

loaded via a namespace (and not attached): [1] gtable_0.3.5 xfun_0.44 ggplot2_3.5.1 htmlwidgets_1.6.4 [5] lattice_0.22-6 vctrs_0.6.5 tools_4.4.1 generics_0.1.3 [9] sandwich_3.1-0 tibble_3.2.1 fansi_1.0.6 highr_0.11 [13] cluster_2.1.6 pkgconfig_2.0.3 R.oo_1.26.0 Matrix_1.7-0 [17] data.table_1.15.4 checkmate_2.3.1 lifecycle_1.0.4 R.cache_0.16.0 [21] compiler_4.4.1 stringr_1.5.1 MatrixModels_0.5-3 munsell_0.5.1 [25] codetools_0.2-20 SparseM_1.83 quantreg_5.98 htmltools_0.5.8.1 [29] htmlTable_2.4.2 Formula_1.2-5 pillar_1.9.0 MASS_7.3-61 [33] R.utils_2.12.3 rpart_4.1.23 multcomp_1.4-25 nlme_3.1-165 [37] styler_1.10.3 tidyselect_1.2.1 digest_0.6.35 mvtnorm_1.2-5 [41] polspline_1.1.25 stringi_1.8.4 dplyr_1.1.4 purrr_1.0.2 [45] splines_4.4.1 fastmap_1.2.0 grid_4.4.1 colorspace_2.1-0 [49] cli_3.6.2 magrittr_2.0.3 base64enc_0.1-3 survival_3.7-0 [53] utf8_1.2.4 TH.data_1.1-2 foreign_0.8-86 withr_3.0.0 [57] scales_1.3.0 backports_1.5.0 rmarkdown_2.27 nnet_7.3-19 [61] gridExtra_2.3 zoo_1.8-12 R.methodsS3_1.8.2 evaluate_0.24.0 [65] knitr_1.47 rlang_1.1.4 glue_1.7.0 renv_1.0.7 [69] rstudioapi_0.16.0 R6_2.5.1

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_CausalInference_gfoRmula_issues_39&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=LdWqYgjY9ckccBYvpiAnjE30O4qK4wn7lb-i85xLduo&m=AVxWJXzXPm6uIXaqYcYJfKm3-qkMWrI2sDPXq-aqOS0KQWspppqW8-Edt2IYvIKQ&s=DcQTQxiX47K_pO8NGAtkaXpTIZ-w2kuGJBsZmW0pBGU&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEZDUJI3FTZTRL2LJHHY2LTZ3N4TNAVCNFSM6AAAAABP4NZ2W2VHI2DSMVQWIX3LMV43ASLTON2WKOZSGU4DKMJYGM2DANI&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=LdWqYgjY9ckccBYvpiAnjE30O4qK4wn7lb-i85xLduo&m=AVxWJXzXPm6uIXaqYcYJfKm3-qkMWrI2sDPXq-aqOS0KQWspppqW8-Edt2IYvIKQ&s=JWCZBaV2e0wfj_NurnIYbWDYBVTt9G9DJ6VCQBFNFtA&e=. You are receiving this because you are subscribed to this thread.Message ID: @.***>

Lachlan-Cribb commented 2 days ago

Hi Roger,

I tried with Hmisc::rcspline.eval and it's the same issue, only a single parameter is estimated. This seems to be an issue with Hmisc. Executing the following line returns a vector rather than a matrix as its documentation suggests:

> str(Hmisc::rcspline.eval(basicdata_nocomp$L2, knots = c(-1,-0.5,0,1)))
 num [1:13170, 1:2] 1.72 1.04e-04 2.51e-07 1.51 0.00 ...
 - attr(*, "knots")= num [1:4] -1 -0.5 0 1

If it is a Hmisc issue, might be something to be aware of because people may be using rcspline.eval() inside gformula without errors/warnings and not getting the results they should.

rwlogan commented 2 days ago

Hi, Can you run the non-bootstrap R code in the benchmark package? It is in the gfoRmula-benchmark repository, https://github.com/CausalInference/gfoRmula-benchmark .

I ran this code with the latest version of R for windows in Rstudio and got the correct output for the splines. (Added the model_fits=TRUE option. )

Roger

summary(gform_test$fits$lncd4_v)

Call: stats::glm(formula = stats::as.formula(paste(covmodels[j])), family = eval(parse(text = famtext)), data = obs_data, y = TRUE)

Coefficients:

Hmisc::rcspline.eval(lag1_lnrna_v, knots = c(6.215, 7.601, 9.211, 10.597, 11.513))1 1.339e-01 1.106e-02 12.112 < 2e-16 Hmisc::rcspline.eval(lag1_lnrna_v, knots = c(6.215, 7.601, 9.211, 10.597, 11.513))2 -3.125e-01 3.933e-02 -7.946 1.94e-15 Hmisc::rcspline.eval(lag1_lnrna_v, knots = c(6.215, 7.601, 9.211, 10.597, 11.513))3 2.961e-01 7.394e-02 4.004 6.22e-05 ***

Hmisc::rcspline.eval(lag1_lncd4_v, knots = c(3.912, 4.605, 5.298, 5.858, 6.215))1 5.480e-02 3.964e-02 1.382 0.166900 Hmisc::rcspline.eval(lag1_lncd4_v, knots = c(3.912, 4.605, 5.298, 5.858, 6.215))2 -2.539e-02 1.282e-01 -0.198 0.843059 Hmisc::rcspline.eval(lag1_lncd4_v, knots = c(3.912, 4.605, 5.298, 5.858, 6.215))3 -1.824e-01 1.924e-01 -0.948 0.342961 -5.849e-03 7.945e-04 -7.362 1.82e-13 Hmisc::rcspline.eval(lnrna_v, knots = c(6.215, 7.601, 9.211, 10.597, 11.513))1 -3.236e-01 1.138e-02 -28.442 < 2e-16 Hmisc::rcspline.eval(lnrna_v, knots = c(6.215, 7.601, 9.211, 10.597, 11.513))2 8.323e-01 4.061e-02 20.497 < 2e-16 Hmisc::rcspline.eval(lnrna_v, knots = c(6.215, 7.601, 9.211, 10.597, 11.513))3 -9.240e-01 7.737e-02 -11.943 < 2e-16


From: Lachlan Cribb @.> Sent: Monday, October 14, 2024 4:58 AM To: CausalInference/gfoRmula @.> Cc: Logan, Roger @.>; Comment @.> Subject: Re: [CausalInference/gfoRmula] Restricted cubic splines (Issue #39)

Hi Roger,

I tried with Hmisc::rcspline.eval and it's the same issue, only a single parameter is estimated. This seems to be an issue with Hmisc. Executing the following line returns a vector rather than a matrix as its documentation suggests:

str(Hmisc::rcspline.eval(basicdata_nocomp$L2, knots = c(-1,-0.5,0,1))) num [1:13170, 1:2] 1.72 1.04e-04 2.51e-07 1.51 0.00 ...

  • attr(*, "knots")= num [1:4] -1 -0.5 0 1

If it is a Hmisc issue, might be something to be aware of because people may be using rcspline.eval() inside gformula without errors/warnings and not getting the results they should.

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_CausalInference_gfoRmula_issues_39-23issuecomment-2D2410506061&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=LdWqYgjY9ckccBYvpiAnjE30O4qK4wn7lb-i85xLduo&m=HkcQyxfpz2XC8IxCCkGM1s7AXRoXEJa1htyC-fub4gMm6yL9OUCEVNMUnqR9cNqy&s=RjnSqQJHddYOSA_WQ-XvoSHblUB65sXia7bM0PunVYg&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AEZDUJJDMMICH6C5OPTWJALZ3OBUJAVCNFSM6AAAAABP4NZ2W2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJQGUYDMMBWGE&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=LdWqYgjY9ckccBYvpiAnjE30O4qK4wn7lb-i85xLduo&m=HkcQyxfpz2XC8IxCCkGM1s7AXRoXEJa1htyC-fub4gMm6yL9OUCEVNMUnqR9cNqy&s=W1nSOrv0kDMRGmah5V_Yn6T5Mkk2ppDcgtILS3dIyYk&e=. You are receiving this because you commented.

Lachlan-Cribb commented 1 day ago

I have been confused. The rcspline.eval function works fine. I use the rcs() function from the rms package routinely, and there are always k-1 parameters estimated, where k is the number of knots. With rcspline.eval(), there are only k-2 parameters. So I thought there must be a problem.

The difference is because when rcs interanlly calls rcspline.eval, it has the inclx argument set to TRUE (by default it is false). So a matrix with k-1 columns is created. I don't know enough about restricted cubic splines to know what role the inclx argument plays.