cellarium-ai / cellarium-ml

Distributed single-cell data analysis.
BSD 3-Clause "New" or "Revised" License
20 stars 3 forks source link

Test data: categorical obs fields #161

Closed sjfleming closed 6 months ago

sjfleming commented 6 months ago

@ordabayevy I may be wrong about this, but in trying to write a test for the scvi CLI using the test files

https://storage.googleapis.com/dsp-cellarium-cas-public/test-data/test_{0..1}.h5ad

I found that the categorical obs column adata.obs["dataset_id"] seems to have categories which are specific to the given shard. By this I mean that if I download test_0.h5ad and test_1.h5ad and run

>>> import scanpy as sc

>>> adata0 = sc.read_h5ad('test_0.h5ad')
>>> adata1 = sc.read_h5ad('test_1.h5ad')

>>> len(adata0.obs['dataset_id'].cat.categories)
66
>>> len(adata1.obs['dataset_id'].cat.categories)
75

I see that the two shards do not contain the same categories. This makes it impossible to run scvi, since we need the categories to match in every shard. (We use cellarium.ml.utilities.data.categories_to_codes and it returns values of -1 sometimes, which is what caused me to investigate.)

Can these test data files be regenerated so that all categorical obs values contain every possible category (in .cat.categories) in each shard?

ordabayevy commented 6 months ago

Great catch! I've updated the test anndata files. Can you check if it works now?

sjfleming commented 6 months ago

Woohoo, works now! Thank you!