adw96 / breakaway

Species richness with high diversity
68 stars 18 forks source link

Error with matrix when running betta() #108

Closed eepackard closed 3 years ago

eepackard commented 3 years ago

Hello,

I am receiving an error when testing my hypothesis with betta(). A similar issue was discussed in #85 but I wasn't clear about how it was resolved.

Here is what I was trying to test: bm <- breakaway::betta(chats = betta.df.min$Estimate, ses = betta.df.min$Error, X = model.matrix(~Severity*Sample_Date, data = betta.df.min)) Error in solve.default(t(X_effective) %*% X_effective) : system is computationally singular: reciprocal condition number = 9.07297e-20 My model.matrix is quite large as Severity is a factor with 6 levels and Sample_Date is a factor with 4 levels. Is it just that my model is overparameterized? I tried to run with only Severity but it provides a different error instead:

bm <- breakaway::betta(chats = betta.df.min$Estimate, ses = betta.df.min$Error, X = model.matrix(~Severity, data = betta.df.min)) Error in solve.default(R) : Lapack routine dgesv: system is exactly singular: U[24,24] = 0

Will the function still work if the assumption of normality is violated? And if normality is violated is there an appropriate way to transform the estimated richness and its uncertainty?

Any help would be appreciated. This is my first time utilizing GitHub to discuss a coding issue. :) Please let me know if there is additional information needed for troubleshooting.

Session info

devtools::install_github("adw96/breakaway")
Skipping install of 'breakaway' from a github remote, the SHA1 (34231bee) has not changed since last install.
  Use `force = TRUE` to force installation
> library(breakaway)
> library(phyloseq)
> library(magrittr)

Attaching package: ‘magrittr’

The following object is masked from ‘package:tidyr’:

    extract

> library(tidyverse)
── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.1 ──
✓ tibble  3.1.3     ✓ stringr 1.4.0
✓ purrr   0.3.4     ✓ forcats 0.5.1
Error: package or namespace load failed for ‘tidyverse’:
 .onAttach failed in attachNamespace() for 'tidyverse', details:
  call: value[[3L]](cond)
  error: Package ‘tibble’ version 3.1.2 cannot be unloaded:
 Error in unloadNamespace(package) : namespace ‘tibble’ is imported by ‘ggpubr’, ‘tidyr’, ‘rio’, ‘microbiome’, ‘broom’, ‘breakaway’, ‘dbplyr’, ‘rstatix’, ‘dplyr’, ‘haven’, ‘modelr’, ‘readr’, ‘ggplot2’ so cannot be unloaded
> R.version
               _                           
platform       x86_64-apple-darwin17.0     
arch           x86_64                      
os             darwin17.0                  
system         x86_64, darwin17.0          
status                                     
major          4                           
minor          1.0                         
year           2021                        
month          05                          
day            18                          
svn rev        80317                       
language       R                           
version.string R version 4.1.0 (2021-05-18)
nickname       Camp Pontanezen             
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] magrittr_2.0.1    visreg_2.7.0      reshape2_1.4.4    microbiome_1.14.0 ANCOMBC_1.2.0     car_3.0-11        carData_3.0-4    
 [8] ggpubr_0.4.0      geosphere_1.5-10  vegan_2.5-7       lattice_0.20-44   permute_0.9-5     dplyr_1.0.7       plyr_1.8.6       
[15] breakaway_4.7.3   tidyr_1.1.3       phyloseq_1.36.0   ggplot2_3.3.5     readr_2.0.0      

loaded via a namespace (and not attached):
  [1] readxl_1.3.1           backports_1.2.1        igraph_1.2.6           sp_1.4-5               splines_4.1.0         
  [6] usethis_2.0.1          GenomeInfoDb_1.28.1    digest_0.6.27          foreach_1.5.1          htmltools_0.5.1.1     
 [11] fansi_0.5.0            memoise_2.0.0          cluster_2.1.2          remotes_2.4.0          openxlsx_4.2.4        
 [16] Biostrings_2.60.1      modelr_0.1.8           prettyunits_1.1.1      colorspace_2.0-2       rvest_1.0.1           
 [21] ggrepel_0.9.1          haven_2.4.3            rbibutils_2.2.2        xfun_0.24              callr_3.7.0           
 [26] crayon_1.4.1           RCurl_1.98-1.3         jsonlite_1.7.2         lme4_1.1-27.1          survival_3.2-11       
 [31] iterators_1.0.13       ape_5.5                glue_1.4.2             gtable_0.3.0           zlibbioc_1.38.0       
 [36] XVector_0.32.0         pkgbuild_1.2.0         Rhdf5lib_1.14.2        BiocGenerics_0.38.0    abind_1.4-5           
 [41] scales_1.1.1           DBI_1.1.1              rstatix_0.7.0          Rcpp_1.0.7             foreign_0.8-81        
 [46] stats4_4.1.0           httr_1.4.2             ellipsis_0.3.2         pkgconfig_2.0.3        farver_2.1.0          
 [51] dbplyr_2.1.1           utf8_1.2.2             tidyselect_1.1.1       labeling_0.4.2         rlang_0.4.11          
 [56] munsell_0.5.0          cellranger_1.1.0       tools_4.1.0            cachem_1.0.5           cli_3.0.1             
 [61] generics_0.1.0         ade4_1.7-17            devtools_2.4.2         broom_0.7.9            evaluate_0.14         
 [66] biomformat_1.20.0      stringr_1.4.0          fastmap_1.1.0          yaml_2.2.1             processx_3.5.2        
 [71] knitr_1.33             fs_1.5.0               zip_2.2.0              purrr_0.3.4            nlme_3.1-152          
 [76] xml2_1.3.2             compiler_4.1.0         rstudioapi_0.13        curl_4.3.2             testthat_3.0.4        
 [81] ggsignif_0.6.2         reprex_2.0.1           tibble_3.1.3           stringi_1.7.3          ps_1.6.0              
 [86] desc_1.3.0             forcats_0.5.1          Matrix_1.3-4           nloptr_1.2.2.2         multtest_2.48.0       
 [91] vctrs_0.3.8            pillar_1.6.2           lifecycle_1.0.0        rhdf5filters_1.4.0     BiocManager_1.30.16   
 [96] Rdpack_2.1.2           data.table_1.14.0      cowplot_1.1.1          bitops_1.0-7           R6_2.5.0              
[101] gridExtra_2.3          rio_0.5.27             IRanges_2.26.0         sessioninfo_1.1.1      codetools_0.2-18      
[106] boot_1.3-28            MASS_7.3-54            assertthat_0.2.1       pkgload_1.2.1          rhdf5_2.36.0          
[111] rprojroot_2.0.2        withr_2.4.2            S4Vectors_0.30.0       GenomeInfoDbData_1.2.6 mgcv_1.8-36           
[116] parallel_4.1.0         hms_1.1.0              grid_4.1.0             tidyverse_1.3.1        minqa_1.2.4           
[121] rmarkdown_2.9          Rtsne_0.15             Biobase_2.52.0         lubridate_1.7.10      
adw96 commented 3 years ago

Hi @eepackard -- welcome to breakaway and GitHub! Let's start by figuring out what's going wrong with running betta. Could you please show me the output of the following?

table(betta.df.min$Severity)
table(betta.df.min$Sample_Date)

It seems that your model is overparametrized - I think you have as many datapoints as covariates in your model. This isn't a issue with betta, but we can do our best to help troubleshoot. eg., You may have meant to have date/severity be a continuous variable, and they may be categorical with a different category for each observation. I recommend that you write down the model that you want to fit, and making sure that the model that you want to fit aligns with the one that you are actually giving to betta

eepackard commented 3 years ago

Thank you for such a quick response!

Here are the following outputs:

> table(betta.df.min$Severity)

 lp-severe med-severe     medium     severe    surface    unburnt 
         8         12         12         16          8         16 
> table(betta.df.min$Sample_Date)

june  may  oct sept 
  20   20   16   16 

I am trying to fit a model with estimated richness as the response variable and severity and sample date as fixed factors with an interaction. Maybe I am not correctly inputing my data for what I am intending to achieve?

adw96 commented 3 years ago

Interesting -- I'm surprised that you're getting an error. Does this work?

bm <- breakaway::betta(chats = betta.df.min$Estimate, ses = betta.df.min$Error, X = model.matrix(~Severity - 1, data = betta.df.min))

and similarly for X = model.matrix(~Severity*Sample_Date - 1, data = betta.df.min)?

eepackard commented 3 years ago

I tried that first line of code you suggested after reading the discussion from #85 but still got a similar results.

> bm <- breakaway::betta(chats = betta.df.min$Estimate,
+                        ses = betta.df.min$Error,
+                        X = model.matrix(~Severity-1, data = betta.df.min))
Error in solve.default(R) : 
  Lapack routine dgesv: system is exactly singular: U[24,24] = 0

Likewise when I try with X = model.matrix(~Severity*Sample_Date - 1, data = betta.df.min). It just has a different "reciprocal condition number" in the error message

> bm <- breakaway::betta(chats = betta.df.min$Estimate,
+                        ses = betta.df.min$Error,
+                        X = model.matrix(~Severity*Sample_Date-1, data = betta.df.min))
Error in solve.default(t(X_effective) %*% X_effective) : 
  system is computationally singular: reciprocal condition number = 1.23894e-19
adw96 commented 3 years ago

I had an epiphany -- this should work, I think

my_x <- model.matrix(~Severity, data = betta.df.min)[, -1]
bm <- breakaway::betta(chats = betta.df.min$Estimate, ses = betta.df.min$Error, X = my_x)

betta will add an intercept by default, so you can drop it from your covariate matrix.

Sorry for the trouble and fingers crossed this fixes it!

eepackard commented 3 years ago

Sadly, still not working

my.x<-model.matrix(~Severity*Sample_Date,data = betta.df.min)[,-1]
> bm <- breakaway::betta(chats = betta.df.min$Estimate,
+                        ses = betta.df.min$Error,
+                        X = my.x)
Error in solve.default(t(X_effective) %*% X_effective) : 
  system is computationally singular: reciprocal condition number = 1.64234e-19

my.x<-model.matrix(~Severity,data = betta.df.min)[,-1]
> bm <- breakaway::betta(chats = betta.df.min$Estimate,
+                        ses = betta.df.min$Error,
+                        X = my.x)
Error in solve.default(R) : 
  Lapack routine dgesv: system is exactly singular: U[24,24] = 0

This is what the matrix looks like for my.x<-model.matrix(~Severity,data = betta.df.min)[,-1]. So I suppose it is not the intercept causing the issue?

> head(my.x)
 Severitymed-severe Severitymedium Severitysevere Severitysurface Severityunburnt
M10E                  0              0              1               0               0
M10G                  0              0              1               0               0
M11E                  0              0              1               0               0
M11G                  0              0              1               0               0
M12E                  1              0              0               0               0
M12G                  1              0              0               0               0

and also a sample of what my data frame betta.df.min looks like:

> head(betta.df.min)
     Estimate      Error   Severity Sample_Date
M10E  64.46422  0.2725754     severe         may
M10G  60.34763  0.2255343     severe        june
M11E  63.82257  0.6754035     severe         may
M11G  24.23992  0.4901958     severe        june
M12E  81.63864  0.3182650 med-severe         may
M12G 108.06674 73.0194010 med-severe        june

> str(betta.df.min)
'data.frame':   72 obs. of  4 variables:
 $ Estimate   : num  64.5 60.3 63.8 24.2 81.6 ...
 $ Error      : num  0.273 0.226 0.675 0.49 0.318 ...
 $ Severity   : chr  "severe" "severe" "severe" "severe" ...
 $ Sample_Date: chr  "may" "june" "may" "june" ...

Thank you for patience and help!

adw96 commented 3 years ago

This is really strange - most likely your design matrix is not full rank but there's nothing suggesting this as the issue from the other diagnostics. If you could email me the file genereated by saveRDS(betta.df.min, "breakaway-issue-108.RDS") I will play around with it on my machine and get back to you.

My attempt at recreating the issue wasn't successful (below), so unfortunately I don't think I can troubleshoot further without having the data myself 😞

n <- 20*2+16*2
est <- rnorm(n, 100)
sd <- rnorm(n, 10)
Severity <- rep(c("lp-severe", "med-severe", "medium", "severe", "surface", "unburnt"), n/6)
date <- rep(c("may", "june", "oct", "sept"), each = n/4) 

my.x <- model.matrix(~Severity*date)
breakaway::betta(chats = est,
                 ses = sd,
                 X = my.x)
adw96 commented 3 years ago

Fantastic, thanks for emailing me the needed data, @eepackard. Indeed I can reproduce the error.

issue <- readRDS("[emailed-file]")
library(breakaway)
library(magrittr)
library(tidyverse)
issue %>%
  group_by(Severity, Sample_Date) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  print(n=Inf)

We have 6 different severities (issue$Severity %>% unique %>% length) and 4 different sample dates, so 6 x 4 = 24 levels in a full interactive model. However, we only have 22 of these levels represented (looks like no lp-severe's in October or September). As a result, there's no data for betta to estimate for these combination of levels, which is specified in a fully crossed model.

So, some of your options include fitting an additive model (no interaction between Severity and Sample Date), or fitting a reduced interactive model -- basically just estimating parameters for all combinations of levels that exist in the dataset (since there's at least 2 observations per level for all levels that are present).

If the purpose is to investigate the effect of severity (and sample date is thought of as a blocking variable such as a batch effect), I would probably recommend the additive model (though thorough consulting is unfortunately beyond the scope of this forum).

I'm happy to advise on how to do the latter if that's what you want -- let me know :)

eepackard commented 3 years ago

Hi @adw96,

Thank you for looking into this issue further. I understand why that wouldn't work for the crossed model.

I still had issues running betta with an additive model. I get the same error message when using an additive model or even just running one variable separately.

my.x<-model.matrix(~Severity+Sample_Date,data = betta.df.min)
head(my.x)
     (Intercept) SeverityLow SeverityMedium SeverityMedium-Severe SeveritySevere SeverityLP-Severe
M10E           1           0              0                     0              1                 0
M10G           1           0              0                     0              1                 0
M11E           1           0              0                     0              1                 0
M11G           1           0              0                     0              1                 0
M12E           1           0              0                     1              0                 0
M12G           1           0              0                     1              0                 0
     Sample_DateOctober Sample_DateMay Sample_DateJune
M10E                  0              1               0
M10G                  0              0               1
M11E                  0              1               0
M11G                  0              0               1
M12E                  0              1               0
M12G                  0              0               1

However, I think the later might be a better option for my question anyways. Could you please advise me on how to do the later? I would be very interested to know.

Thank you again.

Erica :)

adw96 commented 3 years ago

Ah, I see that you will also have to remove a sample that had zero error in it. Here's a fix for the additive model:

new_x <- issue %>% 
  select(Severity, Sample_Date) %>% 
  tibble %>%
  mutate(across(everything(), as.character)) %>%
  model.matrix(~ Severity + Sample_Date, data = .) 
zero_error <- which(issue$Error == 0)
betta(chats=issue$Estimate[-zero_error], ses = issue$Error[-zero_error], 
      X = new_x[-zero_error, ]) %$%
  table

and for the interactive model with missing levels

drop_missing_levels <- issue %>%
  select(Severity, Sample_Date) %>%
  mutate(combos = paste(Severity, Sample_Date, sep="--")) %>%
  select(combos) %>% 
  model.matrix(~combos - 1, data=.) 
colnames(drop_missing_levels) <- str_remove(colnames(drop_missing_levels), "combos")
betta(chats = issue$Estimate[-zero_error], 
      ses = issue$Error[-zero_error], 
      X = drop_missing_levels[-zero_error, ]) %$% 
  table

Note to self: We should have a warning for a zero error or non-full-rank design matrix!