bioFAM / MOFA

Multi-Omics Factor Analysis
GNU Lesser General Public License v3.0
231 stars 57 forks source link

Truncated feature names after runMOFA #33

Closed fanli-gcb closed 5 years ago

fanli-gcb commented 5 years ago

I noticed that somewhere in the runMOFA function, the feature names are getting truncated and thereby creating non-unique names that breaks downstream code.

As an example, here are two of the original features:

> rownames(MOFAobject@TrainData[["plasma"]])[779:780]
[1] "[plasma] sulfate of piperine metabolite C16H19NO3 (2)*"
[2] "[plasma] sulfate of piperine metabolite C16H19NO3 (3)*"

After running runMOFA:

> MOFAobject2 <- loadModel(modelFile, MOFAobject) # loading results from runMOFA
> rownames(MOFAobject2@TrainData[["plasma"]])[779:780]
[1] "[plasma] sulfate of piperine metabolite C16H19NO3 "
[2] "[plasma] sulfate of piperine metabolite C16H19NO3 "

Any ideas on where this truncation is happening? I have narrowed it down to the runMOFA call, but not sure where within that function.

Thanks in advance for any help!

rargelaguet commented 5 years ago

That is indeed the case. The problem is saving the sample names to the hdf5 file. In the current HDF5 version, the strings are restricted to 50 characters. I couldn't find a way around it.

There should be a warning in prepareMOFA: if (any(nchar(sampleNames(object))>50)) warning("Due to string size limitations in the HDF5 format, sample names will be trimmed to less than 50 characters")

However, there is a simple solution. Just edit the sampleNames manually after loading the model: sampleNames(object) <- sample_names make sure that the order is consistent

fanli-gcb commented 5 years ago

Thanks for the help! Here's the code I used for the workaround in case it's useful for anyone else (notice it is on featureNames instead of sampleNames):

featurenames <- MOFA::featureNames(MOFAobject) # prior to runMOFA
MOFA::featureNames(MOFAobject) <- featurenames[names(MOFA::featureNames(MOFAobject))] # after runMOFA or loadModel