How to download and use local soma directory from cellxgene_census.open_soma()?

chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census

https://chanzuckerberg.github.io/cellxgene-census/

MIT License

72 stars 18 forks source link

How to download and use local soma directory from cellxgene_census.open_soma()? #1201

Closed Alex2975 closed 1 week ago

Alex2975 commented 1 week ago

Dear Authors,

If I want to speed up retrieving the cells, can I download the soma folder only? Using aws s3 sync? Will this work: aws s3 sync --no-sign-request s3://cellxgene-census-public-us-west-2/cell-census/2023-07-25/soma/

If the above works, then how should I open_soma()? Will this work, in which the /tmp/census_soma folder will contain the objects from s3://cellxgene-census-public-us-west-2/cell-census/2023-07-25/soma/: with cellxgene_census.open_soma(uri="/tmp/census_soma") as census:

If the above works, in order to speed up the IO of getting cells, should I change the tilesdb_config, such as making the following buffer bigger? with cellxgene_census.open_soma(tiledb_config={"py.init_buffer_bytes": 128 * 1024**2}) as census:

Thank you so much.

pablo-gar commented 1 week ago

Yes that should work. You can see our documentation related to opening up a local copy of Census here:

https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_aws_open_data.html#how-to-access-aws-census-data

in order to speed up the IO of getting cells, should I change the tilesdb_config, such as making the following buffer bigger? with cellxgene_census.open_soma(tiledb_config={"py.init_buffer_bytes": 128 * 1024**2})

I recommend you use the defaults, which ensure memory utilization of no more than 1GB of memory. With a local copy of Census that will work out just great. Your config defines 0.1 GB which is actually pretty low.

I do think increasing the buffer size may offer some advantages for a local copy. If you do desire to do so I recommend 8GB:

{
    "py.init_buffer_bytes": 8 * 1024**3,
    "soma.init_buffer_bytes": 8 * 1024**3,
}

Alex2975 commented 1 week ago

Thank you very much for the instructions, @pablo-gar .

Alex2975 commented 1 week ago

@pablo-gar , regarding the normalization for Smart-Seq (feat: the normalized layer should contain gene-length normalized counts from SmartSeq data #813), is it done and available for the latest release (2023-12-15)? Thank you very much.

Alex2975 commented 1 week ago

@pablo-gar , would you please also comment on why the duplicated cells come from? One possible way I can think about is duplicated cells come from the authors submitted the same cells in multiple h5ad files. Could that be possible? Are there other scenarios that could result in duplicated cells? Thank you very much.

pablo-gar commented 1 week ago

@pablo-gar regarding the normalization for Smart-Seq (feat: the normalized layer should contain gene-length normalized counts from SmartSeq data https://github.com/chanzuckerberg/cellxgene-census/issues/813), is it done and available for the latest release (2023-12-15)? Thank you very much.

No, you can access the normalized layer with that fix in the "latest non-LTS version" of Census data (census_version = "latest"). We will publish the new LTS next week, you can also wait for that one.

Then in get_anndata()you can use the X_name or X_layers to get the layer.

pablo-gar commented 1 week ago

@pablo-gar , would you please also comment on why the duplicated cells come from? One possible way I can think about is duplicated cells come from the authors submitted the same cells in multiple h5ad files. Could that be possible? Are there other scenarios that could result in duplicated cells? Thank you very much.

The scenarios where that happens is:

Multiple datasets of the same collection contain some level of duplication. For example Tabula Sapiens has an "All cells" dataset and then datasets per compartment.
Meta-analysis of existing data elsewhere in CELLxGENE. For example this Azimuth dataset

Alex2975 commented 1 week ago

Great, thank you so much for the insights, @pablo-gar .