chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
72 stars 18 forks source link

Can we use soma_joinid as cell_id #1195

Closed Alex2975 closed 2 weeks ago

Alex2975 commented 2 weeks ago

Dear Authors,

The following code will give me the obs for all the cells. Each cell has a soma_joinid. Can I use this id as cell id? Will each time I query the database, the soma_joinid will be same for each cell? Can I query the census using soma_joinid to get the same specific cell each time? Thank you so much for the help.

human = census['census_data']['homo_sapiens'] obs_df = human.obs.read().concat().to_pandas()

pablo-gar commented 2 weeks ago

It will as long as you use the same Census version when you open the handle to it. You can learn more about Census data releases and versioning here: Census data releases.

To open the last LTS version of Census as of today you can do the following:

import cellxgene_census
census = cellxgene_census.open_soma(census_version = "2023-12-15")

Unfortunately traceability across Census versions is not possible at the moment. A partial solution to that may come in the future, see this ticket for more information.

Alex2975 commented 2 weeks ago

Thank you so much, @pablo-gar . I also noticed that the gene expression values from the cellxgene_census.get_anndata(), and the gene expression values from the h5ad downloaded from the cellxgene data portal are different. Which one would you recommend to use for building models? Also one h5ad file could have multiple assay types (I think this is the same from cellxgene_census.get_anndata query), should I normalize those two assays if I want to get the highly variable genes? Thank you.

adata_from_api = cellxgene_census.get_anndata( census = census, organism = "Homo sapiens", obs_value_filter = "dataset_id == '983d5ec9-40e8-4512-9e65-a572a9c486cb' ", ) adata_from_web_download: https://datasets.cellxgene.cziscience.com/c18b60ea-7dbc-4705-a3dc-e29da4e43c68.h5ad

pablo-gar commented 2 weeks ago

Happy to assist you.

The data from web downloads is in the form as delivered by the original contributors, in some cases the contributors provide a normalized layer (optional) and the raw counts (required). When that's the case then adata.X has the normalized values and adata.raw.X has the raw counts.

In comparison, when you use the Census defaults of get_anndata() you will always get the raw counts in adata.X.That's why you are observing a difference, but when comparing adata_from_api.X vs adata_from_web_download.raw.X there should not be any differences.

Which one would you recommend to use for building models?

If you are planning to do cross-dataset analysis I would recommend you use the Census API as that will always guarantee you consistency. Otherwise the web downloads suffice.

should I normalize those two assays if I want to get the highly variable genes?

When it comes to best practices of single-cell analysis we are not in the best position to provide concrete guidance, I'd recommend you browse some review papers on this. For example some recent models out there do not strictly normalize/integrate the data, whereas other analysis do require normalization/integration.

For highly variable genes there are a few methods that take into consideration batches, see for example our own Census method based on Scanpy's implementation of Seurat V3 algorithm

@Alex2975 it seems like you were able to use the API, if that's the case could we consider the following resolved #1174

Alex2975 commented 2 weeks ago

@pablo-gar , thank you so much for the help. It is great to understand the differences of those datasets. I have updated #1174.