bioFAM / MOFA2

Multi-Omics Factor Analysis
https://biofam.github.io/MOFA2/
GNU Lesser General Public License v3.0
283 stars 49 forks source link

create_mofa_from_df issue #134

Open alfonsosaera opened 9 months ago

alfonsosaera commented 9 months ago

Hi,

I found an error while working with create_mofa_from_df for a grouped experiment.

My data is composed of two different groups with expression data (normalized RNAseq) and mutational data (binary matrix).

In the long dataframe format look like this:

  sample group feature      view     value
1  s6-01    G1   LCE4A      mRNA -315.3428
2  s7-01    G1   LCE4A      mRNA -315.3428
3  s8-01    G1   LCE4A      mRNA -315.3428
...
4  s6-01    G1  SORCS1 Mutations    1.0000
5  s6-01    G1    NRAP Mutations    1.0000
6  s6-01    G1   MKI67 Mutations    1.0000
...
7  s4-01    G2   OR6K2      mRNA -121.9952
8  s9-01    G2   OR6K2      mRNA -121.9952
9  s0-01    G2   OR6K2      mRNA -121.9952
...
10 s4-01    G2   MKI67 Mutations    1.0000
11 s4-01    G2    NEBL Mutations    1.0000
12 s4-01    G2   ERCC6 Mutations    1.0000
...

I create the object and train just modifying as follows

MOFAobject <- create_mofa( groups_data)
data_opts <- get_default_data_options(MOFAobject)
model_opts <- get_default_model_options(MOFAobject)
model_opts$likelihoods <- c(mRNA="gaussian", Mutations="bernoulli")
model_opts$num_factors <- 10
train_opts$convergence_mode <- "slow"

The samples are loaded without error and I can plot them however when I train the model

MOFAobject <- prepare_mofa( MOFAobject,
                            data_options = data_opts,
                            model_options = model_opts,
                            training_options = train_opts
                          )

I get this

...
######################################
## Training the model with seed 42 ##
######################################

ELBO before training: -16523529.00 

Iteration 1: time=0.80, ELBO=-3204712.29, deltaELBO=13318816.703 (80.60515829%), Factors=10
Iteration 2: time=0.64, Factors=10
Iteration 3: time=0.70, Factors=10
Iteration 4: time=0.65, Factors=10
Iteration 5: time=0.63, Factors=10
Iteration 6: time=0.72, ELBO=-4911453.66, deltaELBO=-1706741.365 (10.32915768%), Factors=10
Warning, lower bound is decreasing...
Iteration 7: time=0.59, Factors=10
Iteration 8: time=0.63, Factors=10
Iteration 9: time=0.59, Factors=10
...
Iteration 994: time=0.58, Factors=10
Iteration 995: time=0.56, Factors=10
Iteration 996: time=0.67, ELBO=nan, deltaELBO=nan (nan%), Factors=10
Iteration 997: time=0.74, Factors=10
Iteration 998: time=0.58, Factors=10
Iteration 999: time=0.56, Factors=10

#######################
## Training finished ##
#######################

Saving model in MOFA2_model.hdf5...

If I modify the input format from long datrame to list of matrices (adding columns filled with NAs to account for missing samples between views) the model trains properly

######################################
## Training the model with seed 42 ##
######################################

ELBO before training: -25545845.89 

Iteration 1: time=0.85, ELBO=-6460958.54, deltaELBO=19084887.345 (74.70837892%), Factors=10
Iteration 2: time=0.50, Factors=10
Iteration 3: time=0.57, Factors=10
Iteration 4: time=0.56, Factors=10
Iteration 5: time=0.68, Factors=10
Iteration 6: time=0.79, ELBO=-5488468.27, deltaELBO=972490.277 (3.80684312%), Factors=10
Iteration 7: time=0.70, Factors=10
Iteration 8: time=0.57, Factors=10
Iteration 9: time=0.50, Factors=10
Iteration 10: time=0.49, Factors=10
...
Iteration 513: time=1.04, Factors=10
Iteration 514: time=0.78, Factors=10
Iteration 515: time=0.74, Factors=10
Iteration 516: time=0.83, ELBO=-5425800.46, deltaELBO=1.117 (0.00000437%), Factors=10

Converged!

#######################
## Training finished ##
#######################

Saving model in MOFA2_model2.hdf5
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MOFA2_1.4.0       viridis_0.6.4     viridisLite_0.4.2 ggplot2_3.4.3    
## [5] pheatmap_1.0.12   tidylog_1.0.2     tidyr_1.3.0       dplyr_1.1.3      
## 
## loaded via a namespace (and not attached):
##  [1] MatrixGenerics_1.6.0 sass_0.4.7           jsonlite_1.8.7      
##  [4] bslib_0.5.1          highr_0.10           stats4_4.1.3        
##  [7] yaml_2.3.7           ggrepel_0.9.3        corrplot_0.92       
## [10] pillar_1.9.0         lattice_0.21-8       glue_1.6.2          
## [13] reticulate_1.31      digest_0.6.33        RColorBrewer_1.1-3  
## [16] colorspace_2.1-0     cowplot_1.1.1        htmltools_0.5.6     
## [19] Matrix_1.6-1         plyr_1.8.8           clisymbols_1.2.0    
## [22] pkgconfig_2.0.3      dir.expiry_1.2.0     purrr_1.0.2         
## [25] scales_1.2.1         HDF5Array_1.22.1     Rtsne_0.16          
## [28] tibble_3.2.1         generics_0.1.3       farver_2.1.1        
## [31] IRanges_2.28.0       cachem_1.0.8         withr_2.5.0         
## [34] BiocGenerics_0.40.0  cli_3.6.1            magrittr_2.0.3      
## [37] evaluate_0.21        fansi_1.0.4          forcats_1.0.0       
## [40] tools_4.1.3          lifecycle_1.0.3      matrixStats_1.0.0   
## [43] basilisk.utils_1.6.0 stringr_1.5.0        Rhdf5lib_1.16.0     
## [46] S4Vectors_0.32.4     munsell_0.5.0        DelayedArray_0.20.0 
## [49] compiler_4.1.3       jquerylib_0.1.4      rlang_1.1.1         
## [52] rhdf5_2.38.1         grid_4.1.3           rhdf5filters_1.6.0  
## [55] labeling_0.4.3       rmarkdown_2.24       basilisk_1.6.0      
## [58] gtable_0.3.4         reshape2_1.4.4       R6_2.5.1            
## [61] gridExtra_2.3        knitr_1.43           uwot_0.1.16         
## [64] fastmap_1.1.1        utf8_1.2.3           filelock_1.0.2      
## [67] stringi_1.7.12       parallel_4.1.3       Rcpp_1.0.11         
## [70] vctrs_0.6.3          png_0.1-8            tidyselect_1.2.0    
## [73] xfun_0.40

Thanks

rargelaguet commented 9 months ago

Hi @alfonsosaera, thanks for the bug report, i suspect it has to do with the binary data. Would it be possible to send me the data frame by email (ricard.argelaguet@gmail.com) so that I can debug this?

alfonsosaera commented 9 months ago

Hi @rargelaguet Thanks for the quick answer. This has been a very busy week for me, I will send the data frame this week for the debugging. Thanks for the support. Alfonso