bioFAM / MOFA2

Multi-Omics Factor Analysis
https://biofam.github.io/MOFA2/
GNU Lesser General Public License v3.0
305 stars 52 forks source link

Inconsistent scaling of feature weights in get_weights() #97

Open oliviaAB opened 2 years ago

oliviaAB commented 2 years ago

Hi,

I have been playing with MOFA and MEFISTO, and I found some inconsistency in the way the feature weights are scaled when using the function get_weights(). Here is a reproducible example:

library(MOFA2)
library(tidyverse)

set.seed(4675)
data <- make_example_data(
  n_views = 3,
  n_samples = 100,
  n_features = 30,
  n_factors = 4
)[[1]]

mofa_input <- create_mofa(data)

mofa_input <- prepare_mofa(
  object = mofa_input,
  data_options = get_default_data_options(mofa_input),
  model_options = get_default_model_options(mofa_input),
  training_options = get_default_training_options(mofa_input)
)

mofa_output <- run_mofa(mofa_input, outfile = "example_mofa.hdf5")

If I extract the weights as a list with scale = TRUE, I get:

get_weights(mofa_output, scale = TRUE)[["view_2"]] %>%
  head()
##                       Factor1      Factor2       Factor3       Factor4
## feature_1_view2 -0.1612196265 -0.018829557 -0.0517932983  0.0081333674
## feature_2_view2 -0.7949943041 -0.009532788 -0.0069889322  0.0007034645
## feature_3_view2 -0.0003177275  0.002657812  0.0029039494  0.0050075375
## feature_4_view2 -0.0009049137  0.005219771  0.0098385321 -0.0039028108
## feature_5_view2  0.0051936120 -0.015646936 -0.0009356091  0.0061702440
## feature_6_view2 -0.7401169584  0.016059141 -0.0090599593 -0.0002146552

While if I extract the scaled weights as a data-frame with as.data.frame = TRUE, I get:

get_weights(mofa_output, scale = TRUE, as.data.frame = TRUE) %>%
  filter(view == "view_2") %>%
  pivot_wider(names_from = factor,
              values_from = value) %>%
  head()
## # A tibble: 6 × 6
##   feature         view     Factor1  Factor2   Factor3   Factor4
##   <fct>           <fct>      <dbl>    <dbl>     <dbl>     <dbl>
## 1 feature_1_view2 view_2 -0.0861   -0.0101  -0.0277    0.00434 
## 2 feature_2_view2 view_2 -0.425    -0.00509 -0.00373   0.000376
## 3 feature_3_view2 view_2 -0.000170  0.00142  0.00155   0.00267 
## 4 feature_4_view2 view_2 -0.000483  0.00279  0.00525  -0.00208 
## 5 feature_5_view2 view_2  0.00277  -0.00836 -0.000500  0.00329 
## 6 feature_6_view2 view_2 -0.395     0.00858 -0.00484  -0.000115

From what I can tell, this is because in the get_weights() function, if as.data.frame is FALSE, the scaling is performed as follows:

weights <- lapply(weights, function(x) x/max(abs(x)))

Which means that the weights are scaled independently for each view. If on the other hand as.data.frame is TRUE, then the scaling is performed at once on the whole data-frame, so across all views and factors:

weights$value <- weights$value/max(abs(weights$value))

I noted that the scaled weights obtained with as.data.frame = TRUE and as.data.frame = FALSE are identical for features from view 1 since it's the view that has the highest absolute weight, which is used for scaling the entire dataset when as.data.frame = TRUE:

map_dbl(get_weights(mofa_output), ~max(abs(.x)))
##   view_1   view_2   view_3 
## 1.972464 1.053267 1.813308

Possible solution?

I personally find it more useful if the weights are scaled by factor rather than by view, as it allows me to see which features are contributing the most to a given factor across all views. But there are times where scaling by view is very useful as well.

Maybe that could become an option? e.g. the scaling parameter could be 'none', 'by_view', 'by_factor', or even 'overall'. I made a custom version of the get_weights() function with this in mind:

get_weights_custom <- function (object, views = "all", factors = "all", abs = FALSE, scale = 'none', as.data.frame = FALSE){
  if (!is(object, "MOFA"))
    stop("'object' has to be an instance of MOFA")

  if(!(scale %in% c('none', 'by_view', 'by_factor', 'overall'))) stop("scale should be one of: 'none', 'by_view', 'by_factor', 'overall'.")

  views <- MOFA2:::.check_and_get_views(object, views)
  factors <- MOFA2:::.check_and_get_factors(object, factors)

  ## Get the raw weights as a long data-frame by default, 
  ## and filter the relevant views and factors
  weights <- get_expectations(object, "W", as.data.frame = TRUE) %>%
    dplyr::filter(view %in% views,
                  factor %in% factors)

  ## Transform to absolute value if needed
  if(abs) weights$value <- abs(weights$value)

  ## If some scaling must be performed
  if(scale != 'none'){

    if(scale == "by_view") weights <- dplyr::group_by(weights, view)
    if(scale == "by_factor") weights <- dplyr::group_by(weights, factor)

    ## When scaling by view or factor, the grouping ensures that the maximum
    ## value is selected within the relevant group (i.e. either view or factor)
    ## otherwise if no grouping is performed (when scale = 'overall'), takes
    ## the max absolute value across the entire dataset.
    weights <- weights %>%
      dplyr::mutate(value = value / max(abs(value))) %>%
      dplyr::ungroup()
  }

  ## If we want to return a list, transform the long data-frame as a list
  ## of tibbles (can add as.data.frame() at the end if we don't want a tibble)
  if(!as.data.frame){
    weights <- purrr::map(views,
                          ~ weights %>%
                            dplyr::filter(view == .x) %>%
                            tidyr::pivot_wider(names_from = factor,
                                               values_from = value) %>%
                            dplyr::select(-view)
                          )

    names(weights) <- views
  }

  return(weights)
}

And now the scaling is consistent:

get_weights_custom(mofa_output, scale = "by_view")[["view_2"]] %>%
  head()
## # A tibble: 6 × 5
##   feature           Factor1  Factor2   Factor3   Factor4
##   <fct>               <dbl>    <dbl>     <dbl>     <dbl>
## 1 feature_1_view2 -0.161    -0.0188  -0.0518    0.00813 
## 2 feature_2_view2 -0.795    -0.00953 -0.00699   0.000703
## 3 feature_3_view2 -0.000318  0.00266  0.00290   0.00501 
## 4 feature_4_view2 -0.000905  0.00522  0.00984  -0.00390 
## 5 feature_5_view2  0.00519  -0.0156  -0.000936  0.00617 
## 6 feature_6_view2 -0.740     0.0161  -0.00906  -0.000215
get_weights_custom(mofa_output, scale = "by_view", as.data.frame = TRUE) %>%
  filter(view == "view_2") %>%
  pivot_wider(names_from = factor,
              values_from = value) %>%
  head()
## # A tibble: 6 × 6
##   feature         view     Factor1  Factor2   Factor3   Factor4
##   <fct>           <fct>      <dbl>    <dbl>     <dbl>     <dbl>
## 1 feature_1_view2 view_2 -0.161    -0.0188  -0.0518    0.00813 
## 2 feature_2_view2 view_2 -0.795    -0.00953 -0.00699   0.000703
## 3 feature_3_view2 view_2 -0.000318  0.00266  0.00290   0.00501 
## 4 feature_4_view2 view_2 -0.000905  0.00522  0.00984  -0.00390 
## 5 feature_5_view2 view_2  0.00519  -0.0156  -0.000936  0.00617 
## 6 feature_6_view2 view_2 -0.740     0.0161  -0.00906  -0.000215

I would love to use something like this for plot_weights() and plot_top_weights as well.

Hope that makes sense, and thanks for your help!