chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
84 stars 20 forks source link

Build MEMENTO cube w/new dimensions #603

Closed atolopko-czi closed 1 year ago

atolopko-czi commented 1 year ago

In https://github.com/mincheoly/memento-cxg, use these revised dims to build the cube:

CUBE_DIMS_OBS = [
    "cell_type",
    "dataset_id",
    "assay",
    "suspension_type",
    "donor_id",
    "disease",
    "sex"
]

Determine which of these should be TileDB Dims (indexed) vs Attrs and ordering of Dims, based upon selectivity.

atolopko-czi commented 1 year ago

Ran for 12 hours and was only 50% complete, so it appears to be 2x slower with extra dimensions. Will attempt some profiling & optimizations.

atolopko-czi commented 1 year ago

Using 500 samples for the multinomial (instead of 5000), the estimators has been generated in 28 hours on an r6id.24xlarge. However, since the earlier partial run w/5000 was projected to run in similar time, it's not clear the lower sample count arg is actually helping.

I also tested a numba-related optimization of the multinomial, but it doesn't provide any improvement.

Need to re-profile to confirm the multinomial generation is in fact the performance hotspot.

atolopko-czi commented 1 year ago

As of commit 2cc5cfce570270ccfbff2e4cd3194a72e493daa2, profiling on small test fixture data on Mac M1, profiling shows performance hotspots are gen_multinomial and compute_all_estimators_for_obs_group, together consume ~87% of the pass 2 computation time when multinomial sample count = 5000. Reducing multinomial sample count from 5000 to 500 reduces gen_multinomial cum time from 36% to 17%.

The compute_all_estimators_for_obs_group is ~50% in both cases and poor performance is due to Pandas groupby. We may be able to optimize that by using numpy+numba to perform grouping instead of Pandas: https://gist.github.com/flcong/cabff3be5f7d96820d62b7f5e264f779.

multinomial w/5000 samples (w/numba): flame_multinomial5000

multinomial w/500 samples (w/numba): flame_multinomial500

multinomial w/5000 samples (without numba): flame_multinomial5000_no_numba

mincheoly commented 1 year ago

@atolopko-czi is there a smaller version of the cube (but with all the covariate/metadata info) that I could use to test input for the hypothesis testing part?

atolopko-czi commented 1 year ago

@mincheoly We can create a smaller cube by specifying a more constrained OBS_VALUE_FILTER. Perhaps we can use a single tissue? What size cube would like? I'm happy to perform the run once decided.

In [2]: cell_counts_df = c['census_info']['summary_cell_counts'].read().concat().to_pandas()
   ...: cell_counts_df[(cell_counts_df.category == 'tissue_general') & (cell_counts_df.organism == 'Homo sapiens')].sort_values
   ...: ('unique_cell_count', ascending=False).head(5)
Out[2]:
     soma_joinid      organism        category ontology_term_id  unique_cell_count  total_cell_count   label
965          965  Homo sapiens  tissue_general   UBERON:0000955            9309576          16053285   brain
956          956  Homo sapiens  tissue_general   UBERON:0000178            8847169           9732410   blood
986          986  Homo sapiens  tissue_general   UBERON:0002048            2907927           5975910    lung
964          964  Homo sapiens  tissue_general   UBERON:0000948            1559973           3125211   heart
957          957  Homo sapiens  tissue_general   UBERON:0000310            1555995           2466739  breast
mincheoly commented 1 year ago

the one i used to build the hypothesis testing example was OBS_VALUE_FILTER = "is_primary_data == True and (cell_type == 'CD14-positive monocyte' or cell_type == 'dendritic cell') and (dataset_id == '1a2e3350-28a8-4f49-b33c-5b67ceb001f6' or dataset_id == '3faad104-2ab8-4434-816d-474d8d2641db')"

mincheoly commented 1 year ago

is there a reason why some feature ids are missing from some groups when the cube is finished?

atolopko-czi commented 1 year ago

This would be expected if there are zero expression values for the group/feature_id pairing. But if you think features are missing where data is known to exist, I can investigate. If so do you have example values?

On Wed, Aug 30, 2023 at 1:14 AM Min Cheol Kim @.***> wrote:

is there a reason why some feature ids are missing from some groups when the cube is finished?

— Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/cellxgene-census/issues/603#issuecomment-1698504892, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG7CE7BYGPGPSEUINGALUU3XX3D4JANCNFSM6AAAAAAZ7LFDPY . You are receiving this because you were mentioned.Message ID: @.***>

mincheoly commented 1 year ago

ah i see. so any testing algorithm should take the missing data as a 0. got it!

I dont have a specific example where I know it should exist.

atolopko-czi commented 1 year ago

@pablo-gar Now that @mincheoly has provided a proof-of-concept notebook for use of the pre-computed estimators, are we good to close out this issue? Future improvements (optimizations, e.g.) can be tracked under new issues.

pablo-gar commented 1 year ago

Yes, we can certainly consider this complete.