const-ae / glmGamPoi

Fit Gamma-Poisson Generalized Linear Models Reliably
102 stars 14 forks source link

test_de failing with on_disk models #33

Open NathanSiemers opened 2 years ago

NathanSiemers commented 2 years ago

Hello Constantin,

Thank you for this code, I've been using and learning more about it for several months. I'm working with some large data sets and using glmGamPoi to perform modeling and hypothesis testing.

I've been able to create full models from HDF5Array input data sets and on_disk = TRUE. I did need to adjust the chunk size params to get this to occur in finite time. However, the on-disk version of the model output cannot be queried with test_de. Performing such test gives the error:

Error in combine_size_factors_and_offset(offset, size_factors, Y, verbose = verbose) : length(offset) == 1 || length(offset) == n_samples is not TRUE

(seems to be coming from estimate_size_factors.R)

If I run the same model with a realized input matrix and on_disk = FALSE, everything works as expected. I wonder if test_de can't determine the right dimensions or related parameters for on_disk models and is instead throwing this error?

It could be a general bug, it could be a problem that only shows up when one alters default H5 temp file paths, etc? But I've looked, and I think the on-disk model objects have hard-coded locations for data contained within them.... Hmm.

Attached is the log file (docx). You can get a full tgz of the code and input data at the link below. I've tried to make this an MRE, but even very truncated the singlecellexperiment file is a little big.

Thanks, Nathan Siemers mre.glmgampoi.bug.docx

tgz file: https://drive.google.com/file/d/1fYN4F7TtFpwgC6yqYZDP_rDHBiV4XU5H/view?usp=sharing

NathanSiemers commented 2 years ago

btw, removing the offset parameter in the glm_gp call doesn't change anything.

const-ae commented 2 years ago

Hey, sorry for the delay. I was at a workshop this week. I did take a briefly this afternoon while waiting for my flight and tried the code below to reproduce the issue, but my version finished successfully.

library(glmGamPoi)
model_matrix <- cbind(1, rnorm(5))
true_Beta <- cbind(rnorm(n = 30), rnorm(n = 30, mean = 3))
sf <- exp(rnorm(n = 5, mean = 0.7))
model_matrix
#>      [,1]       [,2]
#> [1,]    1  1.1480341
#> [2,]    1 -0.6869605
#> [3,]    1 -2.1089572
#> [4,]    1  0.1170987
#> [5,]    1  1.0913436
Y <- matrix(rnbinom(n = 30 * 5, mu = sf * exp(true_Beta %*% t(model_matrix)), size = 1/2.4),
            nrow = 30, ncol = 5)
options(DelayedArray.block.size = 1e9)
DelayedArray::setAutoBlockSize(1e9)
#> automatic block size set to 1e+09 bytes (was 1e+08)
DelayedArray::setAutoBlockShape('last-dim-grows-first')
#> automatic block shape set to "last-dim-grows-first" (was "hypercube")
Y_h5 <- HDF5Array::writeHDF5Array(Y)

lmf <- glm_gp(
  Y_h5,
  model_matrix,
  on_disk = TRUE,
  verbose = TRUE,
  offset = 0
)
#> Calculate Size Factors (normed_sum)
#> Make initial dispersion estimate
#> Make initial beta estimate
#> Estimate beta
#> Estimate dispersion
#> Fit dispersion trend
#> Shrink dispersion estimates
#> Estimate beta again
compare.treat = test_de( lmf, c(0,1), verbose = TRUE )
#> Fit reduced model
#> Calculate quasi likelihood ratio
#> Prepare results

Created on 2022-04-01 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.1.1 (2021-08-10) #> os macOS Big Sur 10.16 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Berlin #> date 2022-04-01 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0) #> beachmat 2.8.1 2021-08-12 [1] Bioconductor #> Biobase 2.52.0 2021-05-19 [1] Bioconductor #> BiocGenerics 0.38.0 2021-05-19 [1] Bioconductor #> bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.0) #> cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0) #> DelayedArray 0.18.0 2021-05-19 [1] Bioconductor #> DelayedMatrixStats 1.14.3 2021-08-26 [1] Bioconductor #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0) #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0) #> GenomeInfoDb 1.28.4 2021-09-05 [1] Bioconductor #> GenomeInfoDbData 1.2.6 2021-09-09 [1] Bioconductor #> GenomicRanges 1.44.0 2021-05-19 [1] Bioconductor #> glmGamPoi * 1.4.0 2021-05-19 [1] Bioconductor #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0) #> HDF5Array 1.20.0 2021-05-19 [1] Bioconductor #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0) #> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.0) #> IRanges 2.26.0 2021-05-19 [1] Bioconductor #> knitr 1.34 2021-09-09 [1] CRAN (R 4.1.1) #> lattice 0.20-44 2021-05-02 [1] CRAN (R 4.1.1) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) #> Matrix 1.3-4 2021-06-01 [1] CRAN (R 4.1.1) #> MatrixGenerics 1.5.3 2021-09-09 [1] Bioconductor #> matrixStats 0.60.1 2021-08-23 [1] CRAN (R 4.1.0) #> pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) #> Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0) #> RCurl 1.98-1.4 2021-08-17 [1] CRAN (R 4.1.0) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.0) #> rhdf5 2.36.0 2021-05-19 [1] Bioconductor #> rhdf5filters 1.4.0 2021-05-19 [1] Bioconductor #> Rhdf5lib 1.14.2 2021-07-06 [1] Bioconductor #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0) #> rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.1) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) #> S4Vectors 0.30.0 2021-05-19 [1] Bioconductor #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0) #> sparseMatrixStats 1.4.2 2021-08-08 [1] Bioconductor #> stringi 1.7.4 2021-08-25 [1] CRAN (R 4.1.0) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0) #> styler 1.5.1 2021-07-13 [1] CRAN (R 4.1.0) #> SummarizedExperiment 1.22.0 2021-05-19 [1] Bioconductor #> tibble 3.1.4 2021-08-25 [1] CRAN (R 4.1.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0) #> xfun 0.26 2021-09-14 [1] CRAN (R 4.1.1) #> XVector 0.32.0 2021-05-19 [1] Bioconductor #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0) #> zlibbioc 1.38.0 2021-05-19 [1] Bioconductor #> #> [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library ```

I will have to take a more detailed look what's the difference between your code/data and my version next week. But thanks for making such a detailed bug report, I am sure we will able to figure out what is going wrong here :)

Best, Constantin

seanken commented 1 year ago

Just wondering if there has been any updates on this issue? I have been having a similiar issue as well. Worth noting: when I pass a matrix (sparse or not) to glm_gp with on_disk=T I have the same issue (think it is suppose to make it a HDF5Array internally, yes?), while if I first make the HDF5Array object then feed it in I do not have the issue, so that work around works for me (in case others need such a work around).

const-ae commented 1 year ago

Hi Sean,

thanks for sharing your experience and the work-around. If you have a reproducible example (https://reprex.tidyverse.org/), I will try to take a look and figure out why the internal conversion fails.

Until then, I endorse your suggestion to manually create a HDF5Array object and call glmGamPoi with that :)