GabrielHoffman / variancePartition

Quantify and interpret divers of variation in multilevel gene expression experiments
http://gabrielhoffman.github.io/variancePartition/
60 stars 14 forks source link

Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length #64

Closed evaesquinas closed 2 years ago

evaesquinas commented 2 years ago

Dear Dr. Hoffman,

I am trying to run variancePartition and fitExtractVarPartModel() to get the fraction of variance explained by each variable added in the formula. In addition, I have previously adjusted my data by 1 variable using removeBatchEffect from limma. In order to see the fraction of variance by this variable after the adjustment, I run fitExtractVarPartModel... but I got this error:

Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
Calls: fitExtractVarPartModel ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
In addition: Warning message:
In matrix(unlist(varPart), nrow = length(varPart), byrow = TRUE) :
  data length [3828] is not a sub-multiple or multiple of the number of rows [1915]
Execution halted

If I run the function again with this variable + another variable, I get the same error but not the warning message.

Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
Calls: fitExtractVarPartModel ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
Execution halted

On the other hand, if I run fitExtractVarPartModel with a DIFFERENT VARIABLE (and NOT adding the variable that I adjusted for), the run works and I don't get any error.

Note 1: The columns (samples) from exprObj (log2(normalized counts + 1)) follows the same order as you can find in the column where the samples are in data. The data has the samples as rownames and exprObj is a matrix. Although I saw that the example that it appears in the vignette has negative values and I have confirmed that my data works when I have a different variable in my formula, it is important to say (just in case) that the adjusted normalized counts contain negative values for some genes.

Note 2: I have tried to run variancePartition after having adjusted by the same variable using ComBat-Seq and I have run the same models and it works. I don't get any errors/warnings.

Do you know what could be the reason of the error?

Any help would be very appreciated.

Thanks very much in advance

My R session:

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] tidyr_1.2.0              biomaRt_2.48.3        dplyr_1.0.8    
 [4] variancePartition_1.22.0 Biobase_2.52.0         BiocGenerics_0.38.0      
 [7] scales_1.1.1             BiocParallel_1.26.0       limma_3.48.3
[10] ggplot2_3.3.5

loaded via a namespace (and not attached):
 [1] nlme_3.1-152           bitops_1.0-7           matrixStats_0.59.0
 [4] pbkrtest_0.5.1         bit64_4.0.5            filelock_1.0.2
 [7] doParallel_1.0.17      progress_1.2.2         httr_1.4.2
[10] GenomeInfoDb_1.28.0    tools_4.1.2            backports_1.4.1
[13] utf8_1.2.2             R6_2.5.1               KernSmooth_2.23-20
[16] DBI_1.1.2              colorspace_2.0-2       withr_2.4.3
[19] tidyselect_1.1.1       prettyunits_1.1.1      curl_4.3.2
[22] bit_4.0.4              compiler_4.1.2         cli_3.1.1
[25] xml2_1.3.3             labeling_0.4.2         caTools_1.18.2
[28] readr_2.1.2            rappdirs_0.3.3         stringr_1.4.0
[31] digest_0.6.29          minqa_1.2.4            colorRamps_2.3
[34] XVector_0.32.0         pkgconfig_2.0.3        lme4_1.1-28
[37] dbplyr_2.1.1           fastmap_1.1.0          rlang_1.0.1
[40] RSQLite_2.2.7          farver_2.1.0           generics_0.1.2
[43] gtools_3.9.2           RCurl_1.98-1.3         magrittr_2.0.2
[46] GenomeInfoDbData_1.2.6 Matrix_1.3-2           Rcpp_1.0.8
[49] munsell_0.5.0          S4Vectors_0.30.0       fansi_1.0.2
[52] lifecycle_1.0.1        stringi_1.7.6          MASS_7.3-54
[55] zlibbioc_1.38.0        gplots_3.1.1           plyr_1.8.6
[58] BiocFileCache_2.0.0    grid_4.1.2             blob_1.2.2
[61] crayon_1.4.2           lattice_0.20-44        Biostrings_2.60.1
[64] splines_4.1.2          hms_1.1.1              KEGGREST_1.32.0
[67] pillar_1.7.0           boot_1.3-28            reshape2_1.4.4
[70] codetools_0.2-18       stats4_4.1.2           XML_3.99-0.6
[73] glue_1.6.1             vctrs_0.3.8            png_0.1-7
[76] nloptr_2.0.0           tzdb_0.2.0             foreach_1.5.2
[79] gtable_0.3.0           purrr_0.3.4            assertthat_0.2.1
[82] cachem_1.0.6           broom_0.7.12           tibble_3.1.6
[85] iterators_1.0.14       AnnotationDbi_1.54.1   memoise_2.0.0
[88] IRanges_2.26.0         ellipsis_0.3.2
GabrielHoffman commented 2 years ago

Can you send a reproducible example of the first issue


From: evaesquinas @.> Sent: Monday, August 8, 2022 7:01:56 AM To: GabrielHoffman/variancePartition @.> Cc: Subscribed @.***> Subject: [GabrielHoffman/variancePartition] Error in .rowNamesDF<-(x, value = value) : invalid 'row.names' length (Issue #64)

USE CAUTION: External Message.

Dear Dr. Hoffman,

I am trying to run variancePartition and fitExtractVarPartModel() to get the fraction of variance explained by each variable added in the formula. In addition, I have previously adjusted my data by 1 variable using removeBatchEffect from limma. In order to see the fraction of variance by this variable after the adjustment, I run fitExtractVarPartModel... but I got this error:

Error in .rowNamesDF<-(x, value = value) : invalid 'row.names' length Calls: fitExtractVarPartModel ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<- In addition: Warning message: In matrix(unlist(varPart), nrow = length(varPart), byrow = TRUE) : data length [3828] is not a sub-multiple or multiple of the number of rows [1915] Execution halted

If I run the function again with this variable + another variable, I get the same error but not the warning message.

Error in .rowNamesDF<-(x, value = value) : invalid 'row.names' length Calls: fitExtractVarPartModel ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<- Execution halted

On the other hand, if I run fitExtractVarPartModel with a DIFFERENT VARIABLE (and NOT adding the variable that I adjusted for), the run works and I don't get any error.

Note 1: The columns (samples) from exprObj (log2(normalized counts + 1)) follows the same order as you can find in the column where the samples are in data. The data has the samples as rownames and exprObj is a matrix.

Note 2: I have tried to run variancePartitition after having adjusted by the same variable using ComBat-Seq and I have run the same models and it works. I don't get any errors/warnings.

Do you know what could be the reason of the error?

Any help would be very appreciated.

Thanks very much in advance

My R session:

R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.2 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base

other attached packages: [1] tidyr_1.2.0 biomaRt_2.48.3 targetidwp4_0.1.0 [4] dplyr_1.0.8 variancePartition_1.22.0 Biobase_2.52.0 [7] BiocGenerics_0.38.0 scales_1.1.1 BiocParallel_1.26.0 [10] limma_3.48.3 ggplot2_3.3.5

loaded via a namespace (and not attached): [1] nlme_3.1-152 bitops_1.0-7 matrixStats_0.59.0 [4] pbkrtest_0.5.1 bit64_4.0.5 filelock_1.0.2 [7] doParallel_1.0.17 progress_1.2.2 httr_1.4.2 [10] GenomeInfoDb_1.28.0 tools_4.1.2 backports_1.4.1 [13] utf8_1.2.2 R6_2.5.1 KernSmooth_2.23-20 [16] DBI_1.1.2 colorspace_2.0-2 withr_2.4.3 [19] tidyselect_1.1.1 prettyunits_1.1.1 curl_4.3.2 [22] bit_4.0.4 compiler_4.1.2 cli_3.1.1 [25] xml2_1.3.3 labeling_0.4.2 caTools_1.18.2 [28] readr_2.1.2 rappdirs_0.3.3 stringr_1.4.0 [31] digest_0.6.29 minqa_1.2.4 colorRamps_2.3 [34] XVector_0.32.0 pkgconfig_2.0.3 lme4_1.1-28 [37] dbplyr_2.1.1 fastmap_1.1.0 rlang_1.0.1 [40] RSQLite_2.2.7 farver_2.1.0 generics_0.1.2 [43] gtools_3.9.2 RCurl_1.98-1.3 magrittr_2.0.2 [46] GenomeInfoDbData_1.2.6 Matrix_1.3-2 Rcpp_1.0.8 [49] munsell_0.5.0 S4Vectors_0.30.0 fansi_1.0.2 [52] lifecycle_1.0.1 stringi_1.7.6 MASS_7.3-54 [55] zlibbioc_1.38.0 gplots_3.1.1 plyr_1.8.6 [58] BiocFileCache_2.0.0 grid_4.1.2 blob_1.2.2 [61] crayon_1.4.2 lattice_0.20-44 Biostrings_2.60.1 [64] splines_4.1.2 hms_1.1.1 KEGGREST_1.32.0 [67] pillar_1.7.0 boot_1.3-28 reshape2_1.4.4 [70] codetools_0.2-18 stats4_4.1.2 XML_3.99-0.6 [73] glue_1.6.1 vctrs_0.3.8 png_0.1-7 [76] nloptr_2.0.0 tzdb_0.2.0 foreach_1.5.2 [79] gtable_0.3.0 purrr_0.3.4 assertthat_0.2.1 [82] cachem_1.0.6 broom_0.7.12 tibble_3.1.6 [85] iterators_1.0.14 AnnotationDbi_1.54.1 memoise_2.0.0 [88] IRanges_2.26.0 ellipsis_0.3.2

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GabrielHoffman_variancePartition_issues_64&d=DwMCaQ&c=shNJtf5dKgNcPZ6Yh64b-ALLUrcfR-4CCQkZVKC8w3o&r=KdYcmw5SdXylMrTGSuNVkNJulowod64k0PTDC5BHZkk&m=L1LgY5AaIlEMttYRgvt4_pXoERNruA0PYbVDcsRJ5MKUQ23f5S-Anbdqfab2TrvL&s=_ojU5X_AfWOILOP0UpS-gF7t4i3cw8piQx1tYcIMlXU&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADFAXTOW4XCXJBBJRPCSSVLVYDSKJANCNFSM554WESUQ&d=DwMCaQ&c=shNJtf5dKgNcPZ6Yh64b-ALLUrcfR-4CCQkZVKC8w3o&r=KdYcmw5SdXylMrTGSuNVkNJulowod64k0PTDC5BHZkk&m=L1LgY5AaIlEMttYRgvt4_pXoERNruA0PYbVDcsRJ5MKUQ23f5S-Anbdqfab2TrvL&s=Vu_grmyGaL1xbzLajXJqTPoyqoA6vy4sfU5TxPeWokk&e=. You are receiving this because you are subscribed to this thread.Message ID: @.***>

evaesquinas commented 2 years ago

First of all, thanks very much for your reply. Due I am working with sensitive data, I have produced a similar example that gives me the error. Note that I tried to reduce the example as much as I could to get the error and to avoid pasting a huge example. Find attached the data, since it is quite big and I cannot upload it as a comment.

pheno.txt cts.txt

library(variancePartition)
library(dplyr)
library(tidyr)
cts <- read.csv("./cts.txt", sep="")
pheno <- read.csv("./pheno.txt", sep="")

categorical_variables <- c("Sample", "Var1", "Var2")
pheno[,categorical_variables] <- lapply(pheno[,categorical_variables],as.factor)
str(pheno)

cts.mat <- as.matrix(cts)

Model 1

set.seed(11344)
model1 = ~ (1|Var1)
varPart <- fitExtractVarPartModel(cts.mat, model1, pheno)
set.seed(NULL)

It works without any problem.

Model 2

set.seed(11344)
model2 = ~ (1|Var2) 
varPart <- fitExtractVarPartModel(cts.mat, model2, pheno)
set.seed(NULL)

I get this error:

Dividing work into 100 chunks...
Total:4 s
Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
Calls: fitExtractVarPartModel ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
In addition: Warning message:
In matrix(unlist(varPart), nrow = length(varPart), byrow = TRUE) :
  data length [30] is not a sub-multiple or multiple of the number of rows [16]
Execution halted

Model 3

model3 = ~ (1|Var1) + (1|Var2)
set.seed(11344)
varPart <- fitExtractVarPartModel(cts.mat, model3, pheno)
set.seed(NULL)

I get this error:

Error in if (inherits(possibleError, "error") && grep("the fixed-effects model matrix is column rank deficient",  :
  missing value where TRUE/FALSE needed
Calls: fitExtractVarPartModel -> fitExtractVarPartModel -> .fitExtractVarPartModel
Execution halted

Note that those errors were obtained using slurm (I just launched my R script in the cluster that I am using, writing this piece of code at the beginning of my R script #!/usr/bin/Rscript --vanilla).

If I run each model in my personal computer (same version of the package) I get slightly different errors.

Model 1:

Dividing work into 100 chunks...

Total:4 s
There were 50 or more warnings (use warnings() to see the first 50)
>   set.seed(NULL)
> warnings()
Warning messages:
1: In serialize(data, node$con, xdr = FALSE) :
  'package:stats' may not be available when loading
2: In serialize(data, node$con, xdr = FALSE) :
  'package:stats' may not be available when loading
(the rest of the warnings are the same)

Model 2: I don't get the previous error In matrix(unlist(varPart), nrow = length(varPart), byrow = TRUE) : data length [30] is not a sub-multiple or multiple of the number of rows [16]

Dividing work into 100 chunks...

Total:1 s
Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
In addition: Warning messages:
1: In serialize(data, node$con, xdr = FALSE) :
  'package:stats' may not be available when loading
(9 more warnings as the first one)

Model 3 (I don't get the same previous error)

Dividing work into 100 chunks...

Total:1 s
Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
In addition: There were 12 warnings (use warnings() to see them)
>   set.seed(NULL)
> warnings()
Warning messages:
1: In optwrap(optimizer, devfun, getStart(start, rho$pp),  ... :
  convergence code -4 from nloptwrap: NLOPT_ROUNDOFF_LIMITED: Roundoff errors led to a breakdown of the optimization algorithm. In this case, the returned minimum may still be useful. (e.g. this error occurs in NEWUOA if one tries to achieve a tolerance too close to machine precision.)
2: In optwrap(optimizer, devfun, getStart(start, rho$pp),  ... :
  convergence code -4 from nloptwrap: NLOPT_ROUNDOFF_LIMITED: Roundoff errors led to a breakdown of the optimization algorithm. In this case, the returned minimum may still be useful. (e.g. this error occurs in NEWUOA if one tries to achieve a tolerance too close to machine precision.)
3: In serialize(data, node$con, xdr = FALSE) :
  'package:stats' may not be available when loading

(remaining warnings are the same as the 3rd one)

Note that I use seeds just in case to have the same outputs (although it doesn't usually help and I am getting different errors...)

I hope this can help you a bit more to understand my problem.

Thanks very much in advance

GabrielHoffman commented 2 years ago

First, some errors are fixed others give more informative messages in the latest version v1.27.3 (https://diseaseneurogenomics.github.io/variancePartition/). I have made substantial fixes since v.1.22.0.

Second, lets look at model 2, but focus on the 5th gene:

set.seed(11344)
model2 = ~ (1|Var2) 
varPart <- fitExtractVarPartModel(cts.mat[5,,drop=FALSE], model2, pheno)

# Warning messages:
# 1: In optwrap(optimizer, devfun, getStart(start, rho$pp), lower = rho$lower,  :
#  convergence code -4 from nloptwrap: NLOPT_ROUNDOFF_LIMITED: Roundoff errors led
# to a breakdown of the optimization algorithm. In this case, the returned minimum 
# may still be useful.  (e.g. this error occurs in NEWUOA if one tries to achieve 
# a tolerance too close to machine precision.)

This is more an issue with the data than with the software. Looking at the expression of gene 5 versus Var2:

df = data.frame(Var2 = pheno$Var2, expr = cts.mat[5,])
df[order(df$Var2),]

We see that there is no variation within category. This means that there is no residual variance, and the model fit fails.

> df[order(df$Var2),]
     Var2          expr
S116    1 -1.981364e-03
S117    1 -1.981364e-03
S118    1 -1.981364e-03
S1      2  4.489186e-05
S2      2  4.489186e-05
S3      3  9.136233e-06
S4      3  9.136233e-06
S5      3  9.136233e-06
S6      3  9.136233e-06
S7      3  9.136233e-06
S8      3  9.136233e-06
S9      3  9.136233e-06
S13     3  9.136233e-06
S17     3  9.136233e-06
S18     3  9.136233e-06
S19     3  9.136233e-06
S20     3  9.136233e-06
S21     3  9.136233e-06
S22     3  9.136233e-06
S10     4  2.412923e-05
S11     4  2.412923e-05
S12     4  2.412923e-05
S14     4  2.412923e-05
S15     4  2.412923e-05
S16     4  2.412923e-05
S24     4  2.412923e-05
S25     4  2.412923e-05
S26     4  2.412923e-05
S28     5  9.884335e-06
S31     5  9.884335e-06
S32     5  9.884335e-06
S34     5  9.884335e-06
S36     5  9.884335e-06
S41     5  9.884335e-06
S42     5  9.884335e-06
S46     5  9.884335e-06
S47     5  9.884335e-06
S48     5  9.884335e-06
S51     5  9.884335e-06
S52     5  9.884335e-06
S54     5  9.884335e-06
S29     6  9.815644e-06
S30     6  9.815644e-06
S33     6  9.815644e-06
S35     6  9.815644e-06
S37     6  9.815644e-06
S38     6  9.815644e-06
S39     6  9.815644e-06
S40     6  9.815644e-06
S43     6  9.815644e-06
S44     6  9.815644e-06
S45     6  9.815644e-06
S49     7 -8.161759e-05
S50     7 -8.161759e-05
S56     7 -8.161759e-05
S57     7 -8.161759e-05
S58     7 -8.161759e-05
S59     7 -8.161759e-05
S60     7 -8.161759e-05
S61     7 -8.161759e-05
S62     7 -8.161759e-05
S63     7 -8.161759e-05
S64     7 -8.161759e-05
S71     7 -8.161759e-05
S55     8  2.274638e-05
S65     8  2.274638e-05
S66     8  2.274638e-05
S67     8  2.274638e-05
S68     8  2.274638e-05
S69     8  2.274638e-05
S70     8  2.274638e-05
S74     8  2.274638e-05
S23     9  9.736041e-06
S27     9  9.736041e-06
S53     9  9.736041e-06
S72     9  9.736041e-06
S73     9  9.736041e-06
S75     9  9.736041e-06
S76     9  9.736041e-06
S77     9  9.736041e-06
S78     9  9.736041e-06
S80     9  9.736041e-06
S83     9  9.736041e-06
S85     9  9.736041e-06
S93     9  9.736041e-06
S97     9  9.736041e-06
S102    9  9.736041e-06
S79    10  4.058674e-05
S81    10  4.058674e-05
S82    10  4.058674e-05
S86    10  4.058674e-05
S89    10  4.058674e-05
S91    10  4.058674e-05
S92    10  4.058674e-05
S94    10  4.058674e-05
S95    10  4.058674e-05
S96    10  4.058674e-05
S98    10  4.058674e-05
S101   10  4.058674e-05
S84    11  2.747669e-05
S87    11  2.747669e-05
S88    11  2.747669e-05
S90    11  2.747669e-05
S99    11  2.747669e-05
S100   11  2.747669e-05
S103   12  2.413696e-05
S104   12  2.413696e-05
S105   12  2.413696e-05
S106   12  2.413696e-05
S107   12  2.413696e-05
S109   12  2.413696e-05
S110   12  2.413696e-05
S111   12  2.413696e-05
S113   12  2.413696e-05
S114   12  2.413696e-05
S115   12  2.413696e-05
S108   13  2.782057e-05
S112   13  2.782057e-05

In real data, this would be extremely unlikely since it means there is no statistical noise in the data. So a regression model is

Gabriel

evaesquinas commented 2 years ago

Dear Dr. Hoffman,

Thanks very much for your help. I updated to v.1.26 (the stable release in Bioconductor) and the errors are quite different. In fact, when I run model 1, it appears something that I haven't seen: Error in eval_f(x, ...): Downdated VtV is not positive definite as well as the warning that you posted. In fact, I have seen here that the origin of this error is from glmer, where there is complete separation in the model.

I would need to ask you some questions more if you don't mind:

Q1: In relation to the following warning.. If the model doesn't fail (using all the genes, not taking only one as you did) but I get this warning, the conclusion will be the same, right? That there is no variation within category. Or should I understand it in another way?

# Warning messages:
# 1: In optwrap(optimizer, devfun, getStart(start, rho$pp), lower = rho$lower,  :
#  convergence code -4 from nloptwrap: NLOPT_ROUNDOFF_LIMITED: Roundoff errors led
# to a breakdown of the optimization algorithm. In this case, the returned minimum 
# may still be useful.  (e.g. this error occurs in NEWUOA if one tries to achieve 
# a tolerance too close to machine precision.)

Q2: What I don't understand is why the model fails when it has all the genes and this doesn't happen (although it gives you a warning) when you have 1 gene (as you tried). How is this possible? Is it because when it finds a gene without any variance, the model stops and it gives you the error?

Q3: Just out of curiosity, how did you choose gene 5? (because if I choose gene 4, gene 6... I don't get the same warning and therefore, there is variance within category). The reason of my question is that I am trying to find a similar case in my real data to see the same effect, but I don't know how to find a gene without variation within category.

Thanks in advance again

GabrielHoffman commented 2 years ago

Q1: This tells you that at least one gene failed. But it unfortunately doesn't tell you which one

Q2: This failure for one gene causes a different error when it tries to assemble the results for multiple genes together. That's why you see the error about the number of rows.

This is not ideal, but in practice it almost never comes up. Further filtering your data should remove the problematic genes. With your simulated data, it means the simulations weren't very realistic because there was no random noise. I don't know about your real data, but it likely has to do with including genes that should be excluded for some simple reason

Q3: I did the analysis one gene at a time and found when it failed

for(i in 1:nrow(cts.mat)){
    message(i)
    varPart <- fitExtractVarPartModel(cts.mat[i,,drop=FALSE], model2, pheno, quiet=TRUE)
}

I bet that the same issue is going to affect most of your problematic genes.

evaesquinas commented 2 years ago

Dear Dr. Hoffman,

Thank very much again for your reply and your patience answering me.

I tried the for loop to do the analysis one gene at a time and it failed in the 15th gene (not the 5th).

I am trying to understand what you replied before: We see that there is no variation within category. This means that there is no residual variance, and the model fit fails.

I thought I understood it, but I have compared the df of gene 1, gene 5 and gene 15 and they follow the same pattern. I don't see any variation within category in any of the genes, but gene 1 didn't fail. Maybe I misunderstood what you said.

Q2: It may be obvious, but what do you exactly mean with "there is no variation within category"? Because I am checking the expression of each gene and they have the same expression in each level of Var2.

(E.g. gene 1) Screenshot from 2022-08-11 10-13-25

Here I attach the results for gene 1, gene 5 and gene 15.

Gene 5

df = data.frame(Var2 = pheno$Var2, expr = cts.mat[5,])
df[order(df$Var2),]
     Var2          expr
S116    1 -1.981364e-03
S117    1 -1.981364e-03
S118    1 -1.981364e-03
S1      2  4.489186e-05
S2      2  4.489186e-05
S3      3  9.136233e-06
S4      3  9.136233e-06
S5      3  9.136233e-06
S6      3  9.136233e-06
S7      3  9.136233e-06
S8      3  9.136233e-06
S9      3  9.136233e-06
S13     3  9.136233e-06
S17     3  9.136233e-06
S18     3  9.136233e-06
S19     3  9.136233e-06
S20     3  9.136233e-06
S21     3  9.136233e-06
S22     3  9.136233e-06
S10     4  2.412923e-05
S11     4  2.412923e-05
S12     4  2.412923e-05
S14     4  2.412923e-05
S15     4  2.412923e-05
S16     4  2.412923e-05
S24     4  2.412923e-05
S25     4  2.412923e-05
S26     4  2.412923e-05
S28     5  9.884335e-06
S31     5  9.884335e-06
S32     5  9.884335e-06
S34     5  9.884335e-06
S36     5  9.884335e-06
S41     5  9.884335e-06
S42     5  9.884335e-06
S46     5  9.884335e-06
S47     5  9.884335e-06
S48     5  9.884335e-06
S51     5  9.884335e-06
S52     5  9.884335e-06
S54     5  9.884335e-06
S29     6  9.815644e-06
S30     6  9.815644e-06
S33     6  9.815644e-06
S35     6  9.815644e-06
S37     6  9.815644e-06
S38     6  9.815644e-06
S39     6  9.815644e-06
S40     6  9.815644e-06
S43     6  9.815644e-06
S44     6  9.815644e-06
S45     6  9.815644e-06
S49     7 -8.161759e-05
S50     7 -8.161759e-05
S56     7 -8.161759e-05
S57     7 -8.161759e-05
S58     7 -8.161759e-05
S59     7 -8.161759e-05
S60     7 -8.161759e-05
S61     7 -8.161759e-05
S62     7 -8.161759e-05
S63     7 -8.161759e-05
S64     7 -8.161759e-05
S71     7 -8.161759e-05
S55     8  2.274638e-05
S65     8  2.274638e-05
S66     8  2.274638e-05
S67     8  2.274638e-05
S68     8  2.274638e-05
S69     8  2.274638e-05
S70     8  2.274638e-05
S74     8  2.274638e-05
S23     9  9.736041e-06
S27     9  9.736041e-06
S53     9  9.736041e-06
S72     9  9.736041e-06
S73     9  9.736041e-06
S75     9  9.736041e-06
S76     9  9.736041e-06
S77     9  9.736041e-06
S78     9  9.736041e-06
S80     9  9.736041e-06
S83     9  9.736041e-06
S85     9  9.736041e-06
S93     9  9.736041e-06
S97     9  9.736041e-06
S102    9  9.736041e-06
S79    10  4.058674e-05
S81    10  4.058674e-05
S82    10  4.058674e-05
S86    10  4.058674e-05
S89    10  4.058674e-05
S91    10  4.058674e-05
S92    10  4.058674e-05
S94    10  4.058674e-05
S95    10  4.058674e-05
S96    10  4.058674e-05
S98    10  4.058674e-05
S101   10  4.058674e-05
S84    11  2.747669e-05
S87    11  2.747669e-05
S88    11  2.747669e-05
S90    11  2.747669e-05
S99    11  2.747669e-05
S100   11  2.747669e-05
S103   12  2.413696e-05
S104   12  2.413696e-05
S105   12  2.413696e-05
S106   12  2.413696e-05
S107   12  2.413696e-05
S109   12  2.413696e-05
S110   12  2.413696e-05
S111   12  2.413696e-05
S113   12  2.413696e-05
S114   12  2.413696e-05
S115   12  2.413696e-05
S108   13  2.782057e-05
S112   13  2.782057e-05

Gene 15

df2 = data.frame(Var2 = pheno$Var2, expr = cts.mat[15,])
df2[order(df2$Var2),]
     Var2          expr
S116    1  1.320086e-05
S117    1  1.320086e-05
S118    1  1.320086e-05
S1      2  3.198804e-05
S2      2  3.198804e-05
S3      3  2.656940e-05
S4      3  2.656940e-05
S5      3  2.656940e-05
S6      3  2.656940e-05
S7      3  2.656940e-05
S8      3  2.656940e-05
S9      3  2.656940e-05
S13     3  2.656940e-05
S17     3  2.656940e-05
S18     3  2.656940e-05
S19     3  2.656940e-05
S20     3  2.656940e-05
S21     3  2.656940e-05
S22     3  2.656940e-05
S10     4  2.049358e-05
S11     4  2.049358e-05
S12     4  2.049358e-05
S14     4  2.049358e-05
S15     4  2.049358e-05
S16     4  2.049358e-05
S24     4  2.049358e-05
S25     4  2.049358e-05
S26     4  2.049358e-05
S28     5 -3.622413e-05
S31     5 -3.622413e-05
S32     5 -3.622413e-05
S34     5 -3.622413e-05
S36     5 -3.622413e-05
S41     5 -3.622413e-05
S42     5 -3.622413e-05
S46     5 -3.622413e-05
S47     5 -3.622413e-05
S48     5 -3.622413e-05
S51     5 -3.622413e-05
S52     5 -3.622413e-05
S54     5 -3.622413e-05
S29     6  2.443886e-05
S30     6  2.443886e-05
S33     6  2.443886e-05
S35     6  2.443886e-05
S37     6  2.443886e-05
S38     6  2.443886e-05
S39     6  2.443886e-05
S40     6  2.443886e-05
S43     6  2.443886e-05
S44     6  2.443886e-05
S45     6  2.443886e-05
S49     7  1.521231e-05
S50     7  1.521231e-05
S56     7  1.521231e-05
S57     7  1.521231e-05
S58     7  1.521231e-05
S59     7  1.521231e-05
S60     7  1.521231e-05
S61     7  1.521231e-05
S62     7  1.521231e-05
S63     7  1.521231e-05
S64     7  1.521231e-05
S71     7  1.521231e-05
S55     8 -1.395212e-04
S65     8 -1.395212e-04
S66     8 -1.395212e-04
S67     8 -1.395212e-04
S68     8 -1.395212e-04
S69     8 -1.395212e-04
S70     8 -1.395212e-04
S74     8 -1.395212e-04
S23     9  2.100696e-05
S27     9  2.100696e-05
S53     9  2.100696e-05
S72     9  2.100696e-05
S73     9  2.100696e-05
S75     9  2.100696e-05
S76     9  2.100696e-05
S77     9  2.100696e-05
S78     9  2.100696e-05
S80     9  2.100696e-05
S83     9  2.100696e-05
S85     9  2.100696e-05
S93     9  2.100696e-05
S97     9  2.100696e-05
S102    9  2.100696e-05
S79    10  2.221810e-05
S81    10  2.221810e-05
S82    10  2.221810e-05
S86    10  2.221810e-05
S89    10  2.221810e-05
S91    10  2.221810e-05
S92    10  2.221810e-05
S94    10  2.221810e-05
S95    10  2.221810e-05
S96    10  2.221810e-05
S98    10  2.221810e-05
S101   10  2.221810e-05
S84    11  1.946738e-05
S87    11  1.946738e-05
S88    11  1.946738e-05
S90    11  1.946738e-05
S99    11  1.946738e-05
S100   11  1.946738e-05
S103   12 -9.118168e-06
S104   12 -9.118168e-06
S105   12 -9.118168e-06
S106   12 -9.118168e-06
S107   12 -9.118168e-06
S109   12 -9.118168e-06
S110   12 -9.118168e-06
S111   12 -9.118168e-06
S113   12 -9.118168e-06
S114   12 -9.118168e-06
S115   12 -9.118168e-06
S108   13 -1.636078e-05
S112   13 -1.636078e-05

Gene 1

df3 = data.frame(Var2 = pheno$Var2, expr = cts.mat[1,])
df3[order(df3$Var2),]
     Var2          expr
S116    1  2.866586e-05
S117    1  2.866586e-05
S118    1  2.866586e-05
S1      2 -5.470136e-05
S2      2 -5.470136e-05
S3      3 -3.663836e-04
S4      3 -3.663836e-04
S5      3 -3.663836e-04
S6      3 -3.663836e-04
S7      3 -3.663836e-04
S8      3 -3.663836e-04
S9      3 -3.663836e-04
S13     3 -3.663836e-04
S17     3 -3.663836e-04
S18     3 -3.663836e-04
S19     3 -3.663836e-04
S20     3 -3.663836e-04
S21     3 -3.663836e-04
S22     3 -3.663836e-04
S10     4 -4.320430e-05
S11     4 -4.320430e-05
S12     4 -4.320430e-05
S14     4 -4.320430e-05
S15     4 -4.320430e-05
S16     4 -4.320430e-05
S24     4 -4.320430e-05
S25     4 -4.320430e-05
S26     4 -4.320430e-05
S28     5 -1.727127e-05
S31     5 -1.727127e-05
S32     5 -1.727127e-05
S34     5 -1.727127e-05
S36     5 -1.727127e-05
S41     5 -1.727127e-05
S42     5 -1.727127e-05
S46     5 -1.727127e-05
S47     5 -1.727127e-05
S48     5 -1.727127e-05
S51     5 -1.727127e-05
S52     5 -1.727127e-05
S54     5 -1.727127e-05
S29     6 -1.375190e-04
S30     6 -1.375190e-04
S33     6 -1.375190e-04
S35     6 -1.375190e-04
S37     6 -1.375190e-04
S38     6 -1.375190e-04
S39     6 -1.375190e-04
S40     6 -1.375190e-04
S43     6 -1.375190e-04
S44     6 -1.375190e-04
S45     6 -1.375190e-04
S49     7 -1.271191e-04
S50     7 -1.271191e-04
S56     7 -1.271191e-04
S57     7 -1.271191e-04
S58     7 -1.271191e-04
S59     7 -1.271191e-04
S60     7 -1.271191e-04
S61     7 -1.271191e-04
S62     7 -1.271191e-04
S63     7 -1.271191e-04
S64     7 -1.271191e-04
S71     7 -1.271191e-04
S55     8 -6.060821e-05
S65     8 -6.060821e-05
S66     8 -6.060821e-05
S67     8 -6.060821e-05
S68     8 -6.060821e-05
S69     8 -6.060821e-05
S70     8 -6.060821e-05
S74     8 -6.060821e-05
S23     9 -1.168303e-05
S27     9 -1.168303e-05
S53     9 -1.168303e-05
S72     9 -1.168303e-05
S73     9 -1.168303e-05
S75     9 -1.168303e-05
S76     9 -1.168303e-05
S77     9 -1.168303e-05
S78     9 -1.168303e-05
S80     9 -1.168303e-05
S83     9 -1.168303e-05
S85     9 -1.168303e-05
S93     9 -1.168303e-05
S97     9 -1.168303e-05
S102    9 -1.168303e-05
S79    10  2.549976e-05
S81    10  2.549976e-05
S82    10  2.549976e-05
S86    10  2.549976e-05
S89    10  2.549976e-05
S91    10  2.549976e-05
S92    10  2.549976e-05
S94    10  2.549976e-05
S95    10  2.549976e-05
S96    10  2.549976e-05
S98    10  2.549976e-05
S101   10  2.549976e-05
S84    11 -2.167725e-04
S87    11 -2.167725e-04
S88    11 -2.167725e-04
S90    11 -2.167725e-04
S99    11 -2.167725e-04
S100   11 -2.167725e-04
S103   12 -4.168729e-04
S104   12 -4.168729e-04
S105   12 -4.168729e-04
S106   12 -4.168729e-04
S107   12 -4.168729e-04
S109   12 -4.168729e-04
S110   12 -4.168729e-04
S111   12 -4.168729e-04
S113   12 -4.168729e-04
S114   12 -4.168729e-04
S115   12 -4.168729e-04
S108   13  8.109313e-05
S112   13  8.109313e-05

Maybe I am overthinking with this example since is not real, but I would like to understand all these errors in case I find them in real data and therefore, I will know what I should do next.

Thanks very much again and sorry for disturbing.

GabrielHoffman commented 2 years ago

You will all most never see these errors in real data...

Q2: 'no variation within category'. For gene 1, all the expression values for category 1 are -1.981364e-03, and all the values for category 2 are 4.489186e-05. This means that there is no variance within category. This almost never comes up in real data because there is always statistical and measurement noise. This is just an issue with your simulations.

S116    1 -1.981364e-03
S117    1 -1.981364e-03
S118    1 -1.981364e-03
S1      2  4.489186e-05
S2      2  4.489186e-05
S3      3  9.136233e-06
S4      3  9.136233e-06
S5      3  9.136233e-06

Q1: dream() and the underlying lmer() functions use iterative numerical methods to estimate parameters. These methods can suffer from rounding errors when values are very small. In this case, there is no variation within categories, so the residual variance is zero, or at least a very, very small number. The exact value of the residual variance is different for each gene due to rounding. So a value of 0 vs 1e-16 can cause different internal checks to fail so you get different warnings or errors.

Overall, I'm not concerned since it is extremely rare to encounter these issue in real data. But I am looking into improving the error messages

Gabriel

evaesquinas commented 2 years ago

Thanks very much for your help and your explanations, I really appreciate it. After all these doubts, I think I understand better what it is going on with my real data, thanks.