Closed NikicaJEa closed 9 months ago
The obs
dataframe contains the dataset_id (in a column named dataset_id
-- see the schema doc for details).
An example where I read dataset_id (sorry, used Python):
In [1]: import cellxgene_census
In [2]: census = cellxgene_census.open_soma(census_version="latest")
In [3]: human_obs = census['census_data']['homo_sapiens'].obs.read().concat().to_pandas()
In [4]: human_obs.keys()
Out[4]:
Index(['soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id',
'cell_type', 'cell_type_ontology_term_id', 'development_stage',
'development_stage_ontology_term_id', 'disease',
'disease_ontology_term_id', 'donor_id', 'is_primary_data',
'observation_joinid', 'self_reported_ethnicity',
'self_reported_ethnicity_ontology_term_id', 'sex',
'sex_ontology_term_id', 'suspension_type', 'tissue',
'tissue_ontology_term_id', 'tissue_type', 'tissue_general',
'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz',
'raw_variance_nnz', 'n_measured_vars'],
dtype='object')
In [5]: human_obs.dataset_id
Out[5]:
0 cda48edf-331d-4ada-96ea-104ccb3147dd
1 cda48edf-331d-4ada-96ea-104ccb3147dd
2 cda48edf-331d-4ada-96ea-104ccb3147dd
3 cda48edf-331d-4ada-96ea-104ccb3147dd
4 cda48edf-331d-4ada-96ea-104ccb3147dd
...
70620414 4b9e0a15-c006-45d9-860f-b8a43ccf7d9d
70620415 4b9e0a15-c006-45d9-860f-b8a43ccf7d9d
70620416 4b9e0a15-c006-45d9-860f-b8a43ccf7d9d
70620417 4b9e0a15-c006-45d9-860f-b8a43ccf7d9d
70620418 4b9e0a15-c006-45d9-860f-b8a43ccf7d9d
Name: dataset_id, Length: 70620419, dtype: object
woops you are right. It somehow it didn't appear when I extracted a small subset. Now I see it. Thanks for the help and sorry for bothering you! You can close this thread.
Hi! Thanks for creating this great API! I would have small feature request. Usually, when I want to extract some data from your database I filter it by cell types. I do it the following way:
Then I continue with some additional filtering criteria. One important filtering criteria for me is to know for each cell from which publication/dataset it comes. I couldn't find this into in the cell_metadata so I guess its not included (or I missed something). If this is the case then it would be helpful if the cell_metadata would have an additional column like dataset_id available. Thanks!