chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
72 stars 19 forks source link

[python] Shuffle multiple SOMA chunks #1103

Closed atolopko-czi closed 1 month ago

atolopko-czi commented 3 months ago

Adds a shuffle_chunk_count parameter.

Improves randomness of shuffling, while allowing for explicit tuning of memory usage vs I/O performance.

codecov[bot] commented 3 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 91.15%. Comparing base (c18c1a9) to head (e3dd3b1). Report is 8 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1103 +/- ## ========================================== + Coverage 91.12% 91.15% +0.02% ========================================== Files 77 77 Lines 5902 5922 +20 ========================================== + Hits 5378 5398 +20 Misses 524 524 ``` | [Flag](https://app.codecov.io/gh/chanzuckerberg/cellxgene-census/pull/1103/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=chanzuckerberg) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/chanzuckerberg/cellxgene-census/pull/1103/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=chanzuckerberg) | `91.15% <100.00%> (+0.02%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=chanzuckerberg#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

prathapsridharan commented 2 months ago

For informational purposes, a description of a general algorithm is captured in this github issue:

https://github.com/chanzuckerberg/cellxgene-census/issues/1146

prathapsridharan commented 1 month ago

Per a synchronous conversation with @ebezzi we decided to scrap the test as it needs more thought. We will make a ticket to write a better test for it but for the sake of expediency, we want to get this merged and get some of our first users to use it. Anecdotal evidence suggests that the scatter-gather-shuffle algorithm is performant and gives good randomness.