Closed atolopko-czi closed 1 year ago
Ran for 12 hours and was only 50% complete, so it appears to be 2x slower with extra dimensions. Will attempt some profiling & optimizations.
Using 500 samples for the multinomial (instead of 5000), the estimators has been generated in 28 hours on an r6id.24xlarge. However, since the earlier partial run w/5000 was projected to run in similar time, it's not clear the lower sample count arg is actually helping.
I also tested a numba-related optimization of the multinomial, but it doesn't provide any improvement.
Need to re-profile to confirm the multinomial generation is in fact the performance hotspot.
As of commit 2cc5cfce570270ccfbff2e4cd3194a72e493daa2, profiling on small test fixture data on Mac M1, profiling shows performance hotspots are gen_multinomial
and compute_all_estimators_for_obs_group
, together consume ~87% of the pass 2 computation time when multinomial sample count = 5000. Reducing multinomial sample count from 5000 to 500 reduces gen_multinomial
cum time from 36% to 17%.
The compute_all_estimators_for_obs_group
is ~50% in both cases and poor performance is due to Pandas groupby
. We may be able to optimize that by using numpy+numba to perform grouping instead of Pandas: https://gist.github.com/flcong/cabff3be5f7d96820d62b7f5e264f779.
multinomial w/5000 samples (w/numba):
multinomial w/500 samples (w/numba):
multinomial w/5000 samples (without numba):
@atolopko-czi is there a smaller version of the cube (but with all the covariate/metadata info) that I could use to test input for the hypothesis testing part?
@mincheoly We can create a smaller cube by specifying a more constrained OBS_VALUE_FILTER. Perhaps we can use a single tissue? What size cube would like? I'm happy to perform the run once decided.
In [2]: cell_counts_df = c['census_info']['summary_cell_counts'].read().concat().to_pandas()
...: cell_counts_df[(cell_counts_df.category == 'tissue_general') & (cell_counts_df.organism == 'Homo sapiens')].sort_values
...: ('unique_cell_count', ascending=False).head(5)
Out[2]:
soma_joinid organism category ontology_term_id unique_cell_count total_cell_count label
965 965 Homo sapiens tissue_general UBERON:0000955 9309576 16053285 brain
956 956 Homo sapiens tissue_general UBERON:0000178 8847169 9732410 blood
986 986 Homo sapiens tissue_general UBERON:0002048 2907927 5975910 lung
964 964 Homo sapiens tissue_general UBERON:0000948 1559973 3125211 heart
957 957 Homo sapiens tissue_general UBERON:0000310 1555995 2466739 breast
the one i used to build the hypothesis testing example was
OBS_VALUE_FILTER = "is_primary_data == True and (cell_type == 'CD14-positive monocyte' or cell_type == 'dendritic cell') and (dataset_id == '1a2e3350-28a8-4f49-b33c-5b67ceb001f6' or dataset_id == '3faad104-2ab8-4434-816d-474d8d2641db')"
is there a reason why some feature ids are missing from some groups when the cube is finished?
This would be expected if there are zero expression values for the group/feature_id pairing. But if you think features are missing where data is known to exist, I can investigate. If so do you have example values?
On Wed, Aug 30, 2023 at 1:14 AM Min Cheol Kim @.***> wrote:
is there a reason why some feature ids are missing from some groups when the cube is finished?
— Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/cellxgene-census/issues/603#issuecomment-1698504892, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG7CE7BYGPGPSEUINGALUU3XX3D4JANCNFSM6AAAAAAZ7LFDPY . You are receiving this because you were mentioned.Message ID: @.***>
ah i see. so any testing algorithm should take the missing data as a 0. got it!
I dont have a specific example where I know it should exist.
@pablo-gar Now that @mincheoly has provided a proof-of-concept notebook for use of the pre-computed estimators, are we good to close out this issue? Future improvements (optimizations, e.g.) can be tracked under new issues.
Yes, we can certainly consider this complete.
In https://github.com/mincheoly/memento-cxg, use these revised dims to build the cube:
Determine which of these should be TileDB Dims (indexed) vs Attrs and ordering of Dims, based upon selectivity.