bioFAM / mofax

Work with trained factor models in Python
MIT License
30 stars 13 forks source link

Issue reading model after updating mdata obs #3

Closed SamuelAMiller1 closed 3 years ago

SamuelAMiller1 commented 3 years ago

I am trying to specify the batches while running MOFA:

mu.tl.mofa(mdata, groups_label = 'Batch', outfile = '/models/batch_seq.hdf5', n_factors=30)

Copying the Batch column from the rna modality (same in both modalities):

mdata.obs['Batch'] = mdata.mod['rna'].obs['Batch'].copy()

Doing so results in this error while trying to read in the model:

import mofax as mfx
model = mfx.mofa_model("models/bmall__batch_abseq.hdf5")
---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
<ipython-input-76-1fcbe55e318f> in <module>
----> 1 model = mfx.mofa_model("/tmp/mofa_20210706-190655.hdf5")

/srv/conda/envs/saturn/lib/python3.8/site-packages/mofax/core.py in __init__(self, filepath, mode)
     84                 samples_metadata.columns = list(self.model['samples_metadata'][self.groups[0]].keys())
     85 
---> 86                 self.samples_metadata = pd.concat([self._samples_metadata, samples_metadata], axis=1)
     87 
     88                 # Decode objects as UTF-8 strings

/srv/conda/envs/saturn/lib/python3.8/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    296     )
    297 
--> 298     return op.get_result()
    299 
    300 

/srv/conda/envs/saturn/lib/python3.8/site-packages/pandas/core/reshape/concat.py in get_result(self)
    514                     obj_labels = obj.axes[1 - ax]
    515                     if not new_labels.equals(obj_labels):
--> 516                         indexers[ax] = obj_labels.get_indexer(new_labels)
    517 
    518                 mgrs_indexers.append((obj._mgr, indexers))

/srv/conda/envs/saturn/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_indexer(self, target, method, limit, tolerance)
   3169 
   3170         if not self.is_unique:
-> 3171             raise InvalidIndexError(
   3172                 "Reindexing only valid with uniquely valued Index objects"
   3173             )

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

If I do not copy obs from a modality into the mdata obs, I am able to read in the model, but in this case I am then unable to specify a groups_label.

gtca commented 3 years ago

Hey @SamuelAMiller1, I am currently unable to reproduce it with the example that I paste below. Could there be an issue with the .obs_names not being unique?

This is the code I've written to reproduce the issue:

import muon as mu
import mofax as mfx

### Create a MuData object to work with ###

import jax
import numpy as np
import pandas as pd

n = 1000
aj, bj = 100, 200

a = mu.AnnData(np.array(jax.random.normal(key=jax.random.PRNGKey(1), shape=(n, aj))))
b = mu.AnnData(np.array(jax.random.normal(key=jax.random.PRNGKey(2), shape=(n, bj))))

a.var_names = [f"var_{j+1}" for j in range(aj)]
b.var_names = [f"var_{j+1}" for j in range(bj)]

a.obs['Batch'] = np.array(jax.random.bernoulli(key=jax.random.PRNGKey(3), shape=(n,)).astype(int))

mdata = mu.MuData({"a": a, "b": b})

### Copy the metadata column as in the issue description ###

mdata.obs['Batch'] = a.obs['Batch'].copy()  # also see a note below

### Run MOFA ###

mu.tl.mofa(mdata, outfile="issue3.hdf5")

### Inspect the model ###

model = mfx.mofa_model("issue3.hdf5")
model.metadata.head()
#   group   Batch   a:Batch
# sample            
# 0 group1  1   1
# 1 group1  0   0
# 2 group1  0   0
# 3 group1  1   1
# 4 group1  1   1

model.close()

Is there anything I could add to this code to match your case closer?


While it is probably mildly relevant to the issue, you should be able to use the column for the batches from the modality without manually copying it to the mdata.obs. If the mdata object was created before mdata.mod['rna'] was modified, you might need to run mdata.update() to sync the content of the mdata with the modalities enclosed. This would actually copy this column for you while taking care of possible missing or extra samples in each modality. Then you should be able to just write mu.tl.mofa(mdata, groups_label = 'rna:Batch', ...).

gtca commented 3 years ago

Also, I am not sure this is the latest master of mofax, could you try that version as well? Not sure it is going to help with this particular issue but we've improved handling of quite a few cases with the most recent version.

Should be installable with the pip in your conda environment:

pip install git+git://github.com/bioFAM/mofax@master
SamuelAMiller1 commented 3 years ago

After further digging, the issue was related to reading in a model where the groups_label had been specified. Per your suggestion, using the latest master seems to have resolved the issue. Thanks!