I found that the categorical obs column adata.obs["dataset_id"] seems to have categories which are specific to the given shard. By this I mean that if I download test_0.h5ad and test_1.h5ad and run
I see that the two shards do not contain the same categories. This makes it impossible to run scvi, since we need the categories to match in every shard. (We use cellarium.ml.utilities.data.categories_to_codes and it returns values of -1 sometimes, which is what caused me to investigate.)
Can these test data files be regenerated so that all categorical obs values contain every possible category (in .cat.categories) in each shard?
@ordabayevy I may be wrong about this, but in trying to write a test for the scvi CLI using the test files
I found that the categorical
obs
columnadata.obs["dataset_id"]
seems to have categories which are specific to the given shard. By this I mean that if I downloadtest_0.h5ad
andtest_1.h5ad
and runI see that the two shards do not contain the same categories. This makes it impossible to run
scvi
, since we need the categories to match in every shard. (We usecellarium.ml.utilities.data.categories_to_codes
and it returns values of-1
sometimes, which is what caused me to investigate.)Can these test data files be regenerated so that all categorical
obs
values contain every possible category (in.cat.categories
) in each shard?