chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
72 stars 19 forks source link

latest release cell metadata "obs" containing "dictionary" datatype in some columns #1081

Closed sunhuaiyu closed 3 months ago

sunhuaiyu commented 3 months ago

Describe the bug

Data type of some columns in "obs" used to be "large-string" but now are "dictionary".

To Reproduce

import cellxgene_census 

CENSUS_VERSION = "latest"
VALUE_FILTER = "is_primary_data == True and assay_ontology_term_id in ['EFO:0010550', 'EFO:0009901', 'EFO:0011025', 'EFO:0009899', 'EFO:0009900', 'EFO:0009922', 'EFO:0030003', 'EFO:0030004', 'EFO:0008995', 'EFO:0008919', 'EFO:0008722', 'EFO:0010010'] and tissue_general == 'liver'"

with cellxgene_census.open_soma(census_version=CENSUS_VERSION) as census:
        human_cell_metadata = (
            census["census_data"]["homo_sapiens"]
            .obs
            .read(value_filter=VALUE_FILTER)
            .concat()
        )

Expected behavior

pyarrow.Table soma_joinid: int64 dataset_id: dictionary assay: dictionary assay_ontology_term_id: dictionary cell_type: dictionary cell_type_ontology_term_id: dictionary development_stage: dictionary development_stage_ontology_term_id: dictionary disease: dictionary disease_ontology_term_id: dictionary donor_id: dictionary is_primary_data: bool observation_joinid: large_string self_reported_ethnicity: dictionary self_reported_ethnicity_ontology_term_id: dictionary sex: dictionary sex_ontology_term_id: dictionary suspension_type: dictionary tissue: dictionary tissue_ontology_term_id: dictionary tissue_type: dictionary tissue_general: dictionary tissue_general_ontology_term_id: dictionary raw_sum: double nnz: int64 raw_mean_nnz: double raw_variance_nnz: double n_measured_vars: int64

Environment

ubuntu20.04 python=3.11 cellxgene-census==1.12.0 tiledbsoma==1.9.3 pyarrow==15.0.2

Additional context

This should have happened in the past month. The last "latest" version on 2024-02-21 didn't have such change.

pablo-gar commented 3 months ago

Thanks for filing this ticket @sunhuaiyu. This week we rolled out a change encode certain Census cell metadata as categorical, you are observing the effects of that change.

You can find out more details here.

https://chanzuckerberg.github.io/cellxgene-census/articles/2024/20240404-categoricals.html