including dataset_id into cell_metadata

NikicaJEa commented 9 months ago

Hi! Thanks for creating this great API! I would have small feature request. Usually, when I want to extract some data from your database I filter it by cell types. I do it the following way:

# Open obs SOMADataFrame
census <- open_soma()

cell_metadata <-  census$get("census_data")$get("homo_sapiens")$get("obs")

# Read as Arrow Table
cell_metadata <-  cell_metadata$read(
  value_filter = "cell_type %in% c('celltype_1', ''celltype_2)",
  column_names = NULL
)

# Concatenates results to an Arrow Table
cell_metadata <-  cell_metadata$concat()

Then I continue with some additional filtering criteria. One important filtering criteria for me is to know for each cell from which publication/dataset it comes. I couldn't find this into in the cell_metadata so I guess its not included (or I missed something). If this is the case then it would be helpful if the cell_metadata would have an additional column like dataset_id available. Thanks!

bkmartinjr commented 9 months ago

The obs dataframe contains the dataset_id (in a column named dataset_id -- see the schema doc for details).

An example where I read dataset_id (sorry, used Python):

In [1]: import cellxgene_census

In [2]: census = cellxgene_census.open_soma(census_version="latest")

In [3]: human_obs = census['census_data']['homo_sapiens'].obs.read().concat().to_pandas()

In [4]: human_obs.keys()
Out[4]: 
Index(['soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id',
       'cell_type', 'cell_type_ontology_term_id', 'development_stage',
       'development_stage_ontology_term_id', 'disease',
       'disease_ontology_term_id', 'donor_id', 'is_primary_data',
       'observation_joinid', 'self_reported_ethnicity',
       'self_reported_ethnicity_ontology_term_id', 'sex',
       'sex_ontology_term_id', 'suspension_type', 'tissue',
       'tissue_ontology_term_id', 'tissue_type', 'tissue_general',
       'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz',
       'raw_variance_nnz', 'n_measured_vars'],
      dtype='object')

In [5]: human_obs.dataset_id
Out[5]: 
0           cda48edf-331d-4ada-96ea-104ccb3147dd
1           cda48edf-331d-4ada-96ea-104ccb3147dd
2           cda48edf-331d-4ada-96ea-104ccb3147dd
3           cda48edf-331d-4ada-96ea-104ccb3147dd
4           cda48edf-331d-4ada-96ea-104ccb3147dd
                            ...                 
70620414    4b9e0a15-c006-45d9-860f-b8a43ccf7d9d
70620415    4b9e0a15-c006-45d9-860f-b8a43ccf7d9d
70620416    4b9e0a15-c006-45d9-860f-b8a43ccf7d9d
70620417    4b9e0a15-c006-45d9-860f-b8a43ccf7d9d
70620418    4b9e0a15-c006-45d9-860f-b8a43ccf7d9d
Name: dataset_id, Length: 70620419, dtype: object

NikicaJEa commented 9 months ago

woops you are right. It somehow it didn't appear when I extracted a small subset. Now I see it. Thanks for the help and sorry for bothering you! You can close this thread.

chanzuckerberg / cellxgene-census

including dataset_id into cell_metadata #952