AnnDataSchema validate_anndata is over-zealous for obs categoricals

cellarium-ai / cellarium-ml

Distributed single-cell data analysis.

BSD 3-Clause "New" or "Revised" License

11 stars 2 forks source link

AnnDataSchema validate_anndata is over-zealous for obs categoricals #179

Closed sjfleming closed 4 months ago

sjfleming commented 4 months ago

Here is where the problem occurs: https://github.com/cellarium-ai/cellarium-ml/blob/220ba90b47378c99d4c08b9d91c5c31b796cb3ca/cellarium/ml/data/schema.py#L62

Here's a simplified version of my problematic example (which occurred during scvi testing with multiple datasets and a "batch" variable as a categorical column in adata.obs):

import pandas as pd

df = pd.DataFrame(data={'col1': ['a', 'b', 'c'], 'col2': ['b', 'a', 'c']})
df['col1'] = df['col1'].astype('category')
df['col2'] = df['col2'].astype('category')

df[['col1']].dtypes.equals(df[['col2']].dtypes)

This, shockingly, returns False.

I propose to change line 62

ordabayevy commented 4 months ago

I think this is happening because the .dtypes returns a Series where index is the original DataFrame's columns. So in one case the index is col1 and in the second it is col2. It checks that the columns are same and that for each column dtype matches. If you want to validate only a subset of columns you can specify obs_columns_to_validate argument in DistributedAnnDataCollection.

sjfleming commented 4 months ago

I should change my example. I think this also happens when the names match, but let me check.

ordabayevy commented 4 months ago

Have you checked that the categories for the batch column match between anndata files? It's possible that when the extracts where created they were not synchronized.

sjfleming commented 4 months ago

Hmmm it looks like you're right. Maybe it is not the fault of

ref_value.dtypes.equals(value.dtypes)

sjfleming commented 4 months ago

Yeah I guess I was wrong about this

import pandas as pd

df1 = pd.DataFrame(data={'col1': ['a', 'b', 'c']})
df2 = pd.DataFrame(data={'col1': ['a', 'b', 'c', 'b']})
df1['col1'] = df1.astype('category')
df2['col1'] = df2.astype('category')
df1.dtypes.equals(df2.dtypes)

True

sjfleming commented 4 months ago

I'll close the linked PR