bioFAM / MOFA

Multi-Omics Factor Analysis
GNU Lesser General Public License v3.0
235 stars 60 forks source link

Imputation of Factor Loadings #12

Closed whitleyo closed 6 years ago

whitleyo commented 6 years ago

Hi,

I'm interested in clustering based on the factor loadings found by MOFA, but noticed that in the final output object after training the MOFA model there will be missing values for factors (as noted in the clusterSamples function documentation). I'm thinking of using kNN to impute factor loadings that are missing for samples, and then applying a clustering algorithm (e.g. k-means, spectral)

Do you guys have any thoughts on using KNN to impute factor loadings? I'm familiar with the basics of the model definition presented in the paper's supplementary info, but not the lower level details.

Thanks!

ttriche commented 6 years ago

you might want to try matrix completion/soft imputation and compare against straight kNN

good idea though, I have holes in my data, I imputed them with kNN prior to decomposition for a grant we put in last fall

--t

On Mon, Mar 19, 2018 at 3:13 PM, whitleyo notifications@github.com wrote:

Hi,

I'm interested in clustering based on the factor loadings found by MOFA, but noticed that in the final output object after training the MOFA model there will be missing values for factors (as noted in the clusterSamples function documentation). I'm thinking of using kNN to impute factor loadings that are missing for samples, and then applying a clustering algorithm (e.g. k-means, spectral)

Do you guys have any thoughts on using KNN to impute factor loadings? I'm familiar with the basics of the model definition presented in the paper's supplementary info, but not the lower level details.

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PMBio/MOFA/issues/12, or mute the thread https://github.com/notifications/unsubscribe-auth/AAARIkkA7Lc1lUj-X6XITaOzp9YVqSRrks5tgANBgaJpZM4Swvgc .

rargelaguet commented 6 years ago

Hi, First question: the MOFA model can have missing values in Z (the factors), and this is a technicality that happens when you have a factor unique to a particular assay, let's say mRNA, and some samples have no mRNA information. The model can not pool information from other assays and it therefore does not make sense for a sample to have any value in this latent space. This is why we set it to NA. If you don't want NAs in order to do clustering, the safest thing is to replace them by zeros.

Second question: imputing factor loadings? The loadings (W) don't have any missing values, only the factors (Z). If you want to "impute" them, see response above and just set them to 0. I guess a kNN approach would also do the job.

P.S. Tim: you mean that you imputed your observations (Y)? The model can safely ignore missing values

whitleyo commented 6 years ago

Sorry for getting the terminology mixed up. Yes, I'm referring to missing values in the factors (latent variables). Thanks!

rargelaguet commented 6 years ago

Let me emphasise that imputing those values is highly uncertain. If the model set it to NA it is because it was not able to pool information from other molecular layers to infer the value of this sample in this factor. Therefore, as factors are sort-of independent, any imputation with kNN using the other factors is highly uncertain and I advise against this.

My recommendation: (1) Use clustering approaches that deal with NAs? (2) If the factor contains only a small set of NA samples, just set it to 0 or do kNN imputation, it shouldn't affect the clustering performance massively. (3) If the factor contains a lot of NAs and it does not explain too many variance, just remove it using subsetFactors()