bioFAM / MOFA2

Multi-Omics Factor Analysis
https://biofam.github.io/MOFA2/
GNU Lesser General Public License v3.0
300 stars 52 forks source link

MEFISTO error with plot_interpolation_vs_covariate() when some factors have been removed #96

Closed oliviaAB closed 2 years ago

oliviaAB commented 2 years ago

Hi,

Thanks for this great package! I've started playing with MEFISTO, and have trained a model. I've used the new_values parameter from get_default_mefisto_options(mefisto_input) to perform the factor interpolation when training the model in python. However I'm getting an error when trying to use the plot_interpolation_vs_covariate() function:

> mefisto_output <- load_model("output/trained_mefisto_model.hdf5",
+                              load_interpol_Z = TRUE)
>
> plot_interpolation_vs_covariate(mefisto_output, only_mean = FALSE)

Error in `spread()`:
! Each row of output must be identified by a unique combination of keys.
Keys are shared for 672 rows:
* 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
* 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30
* 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45
* 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60
* (etc)

I've tracked down the issue to the get_interpolated_factors() function. Specifically, 3 lines before the return, when the factor column is modified to get the name of the factors rather than their ID as integer:

df_interpol <- mutate(df_interpol, factor = factors_names(object)[factor])

The problem is that the model has learned only 3 factors, and so factors_names(object) contains the name of these three factors:

> factors_names(mefisto_output)
[1] "Factor1" "Factor2" "Factor3"

but the interpolated_Z element in the mefisto trained model has interpolated values for the 15 initial factors:

> str(mefisto_output@interpolated_Z)
List of 2
 $ Early:List of 3
  ..$ mean      : num [1:15, 1:14] -1.023 0.886 -0.461 0 0 ...
  ..$ variance  : num [1:15, 1:14] 0.445 0.384 0.761 1 1 ...
  ..$ new_values: int [1, 1:14] 1 2 3 4 5 6 7 8 9 10 ...
 $ Late :List of 3
  ..$ mean      : num [1:15, 1:14] -1.264 0.841 -0.417 0 0 ...
  ..$ variance  : num [1:15, 1:14] 0.288 0.356 0.366 1 1 ...
  ..$ new_values: int [1, 1:14] 1 2 3 4 5 6 7 8 9 10 ...

So when running:

df_interpol <- mutate(df_interpol, factor = factors_names(object)[factor])

values 4 to 15 in the factor column get turned into NAs. This triggers an error in the next line:

df_interpol <- spread(df_interpol, key = type, value = value)

because spread() is expecting unique values for the combination of columns other than type and value.

A solution would be to filter out the rows in df_interpol that correspond to factors that have been discarded, e.g. with:

df_interpol <- mutate(df_interpol, factor = factors_names(object)[factor]) %>%
      filter(!is.na(factor))

Hope this helps!

bv2 commented 2 years ago

Thanks a lot for reporting this bug and getting to the source of it!

We fixed this problem by removing the inactive factors also from the interpolated values (as it is done for other instances in the MOFA object). This is now part of version 1.7.1 and should resolve your issue. Let us know if you encounter any issues with this.

oliviaAB commented 2 years ago

Thank you for the quick reply and fix! I just tried it with the newest version and everything is working fine :)

bv2 commented 2 years ago

Perfect, thanks for the feedback!