Progress Bar for GLMM - Githubissues

DarioS commented 1 month ago

Is it feasible to add a progress bar if a mixed model is used? It has been running for the past five hours without a progress message.

Running GLMM model - this may take a few minutes
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
3410790 dario     20   0   39.0g  21.2g  99840 R  4250   2.8    6d+18h rsession

It also appears to consume all CPUs on a server, causing CPU starvation due to RStudio Server ignoring a user's ~/.bashrc file with:

export OPENBLAS_NUM_THREADS=1

I wonder how to choose between different options for glmm.solver in case one of them is less resource-intensive. I tried "Fisher".

MikeDMorgan commented 1 month ago

Resource use is controlled by a combination of OpenMP and BiocParallel - the latter for parallelising over the nhoods, and OpenMP for multi-threading in the Rcpp code. A default for glmm.solver will be added. In terms of choosing between solvers, Fisher is Quasi-Newton, so should have fast convergence when close to the (local) optimum. The HE is good for solving with a pre-defined covariance matrix, such as a genetic relationship matrix. HE-NNLS is implemented for the cases where variance parameters are negative, which is caused by "true" values being very close to zero. The solver will give a message and highlight that it is switching to the constrained optimisation.

GLMMs tend to scale poorly - while I have implemented a range of measures to cut this down, i.e. parallelisation and multi-threading, however, the bottlenecks are several dense matrix multiplications which are ~ $O(n^3)$.

I'm not sure why your Rstudio is ignoring environmental variables, as this is the recommendation: https://cran.r-project.org/doc/manuals/r-release/R-exts.html#OpenMP-support - you may need to seek Rstudio support to resolve the issue.

DarioS commented 2 weeks ago

Specifying sample ID as a fixed effect isn't a quick workaround to avoid enriched neighbourhoods comprised of one sample.

design <- data.frame(colData(MILOdata))[, c("Sample", "Smoker")]
design <- distinct(design)
rownames(design) <- design$Sample
testNhoods(MILOdata, design = ~ Smoker + Sample, design.df = design, fdr.weighting = "graph-overlap")
Using TMM normalisation
Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset,  : 
  Design matrix not of full rank. The following coefficients not estimable:
 SampleOSCC_26-P

yulijia commented 2 weeks ago

> library(BiocParallel)
> da_results <- testNhoods(hn_milo, design = ~ Condition+(1|patient_ID),
+                          design.df = hn_design,
+                          fdr.weighting="graph-overlap")
Random effects found
Using TMM normalisation
Running GLMM model - this may take a few minutes
Error in testNhoods(hn_milo, design = ~Condition + (1 | patient_ID), design.df = hn_design,  : 
  Lowest traceback returned: 5: handle_error(e)
4: h(simpleError(msg, call))
3: .handleSimpleError(function (e) 
   {
       annotated_condition <- handle_error(e)
       stop(annotated_condition)
   }, "argument is of length zero", base::quote(if (!glmm.control$solver %in% 
       c("HE", "Fisher", "HE-NNLS")) {
       stop(glmm.control$solver, " not recognised - must be HE, HE-NNLS or Fisher")
   }))
2: fitGLMM(X = Xmodel, Z = Zmodel, y = Y[i, ], offsets = off.sets, 
       random.levels = randlevels, REML = reml, dispersion = disper[i], 
       geno.only = genonly, Kin = kinship, glmm.control = glmm.contr)
1: FUN(...)
In addition: There were 16 warnings (use warnings() to see them)

Hi @MikeDMorgan, I performed a GLMM analysis. However, it threw an error that I couldn't identify. Is there any way to check where the error is coming from in this situation?

DarioS commented 2 weeks ago

Parameter glmm.solver is mandatory if a mixed model fitted, but it had no default. That one was already resolved by #328.

yulijia commented 2 weeks ago

Here is another problem: after performing the GLMM, my DA results include NA values. However, plotNhoodGraphDA and groupNhoods cannot handle NA values.

da_results logFC logCPM SE tvalue PValue patient_ID variance Converged Dispersion Logliklihood Nhood 1 -22.8268334 8.004828 10552.160078 -0.002163238 0.9982740 3.540525e+01 FALSE 0.01097306 -82.278581 1 2 -22.4599347 10.410420 10536.597689 -0.002131612 0.9982992 9.792410e+00 FALSE 0.01154928 -84.030955 2 3 -21.0766982 9.975377 10408.624273 -0.002024926 0.9983843 9.374782e-01 FALSE 0.01201461 -129.165084 3 4 NA 10.662702 NA NA NA NA FALSE NA NA 4 5 -22.1270556 9.093788 10354.004528 -0.002137053 0.9982949 7.603262e-01 FALSE 0.01194637 -64.592597 5 6 -22.3102589 8.740509 10457.424887 -0.002133437 0.9982978 3.258474e+01 FALSE 0.01143842 -110.076552 6 7 -21.4542960 13.159483 10246.905729 -0.002093734 0.9983294 2.088114e+01 FALSE 0.01157411 -81.276525 7

plotNhoodGraphDA(hn_milo, da_results, layout="UMAP",alpha=0.1) Error in [<-.data.frame(*tmp*, signif_res$SpatialFDR > alpha, res_column, : missing values are not allowed in subscripted assignments of data frames

da_results <- groupNhoods(hn_milo, da_results,compute.new =T,da.fdr =0.1)# max.lfc.delta = 2, Found NA DA neighbourhoods at FDR 10% Error in ..subscript.2ary(x, l[[1L]], l[[2L]], drop = drop[1L]) : NA subscripts in x[i,j] not supported for 'x' inheriting from sparseMatrix In addition: Warning message: In groupNhoods(hn_milo, da_results, compute.new = T, da.fdr = 0.1) : NA values found in SpatialFDR vector

MikeDMorgan commented 1 week ago

@yulijia You can use subset.nhoods to plot the nhoods with non-NA values. These usually arise because there is model separation - you have a variable that perfectly separates zero from non-zero counts in one or more nhoods. You can check for these using the checkSeparation function, with the goal being to adjust either your NN-graph or nhood refinement such that you don't get this separation (or you exclude these from your DA testing).

On a separate note - your issue is unrelated to the OP, so you should open a new issue not append yours to the end of this one.

MarioniLab / miloR

Progress Bar for GLMM #331