chanzuckerberg / single-cell-data-portal

The data portal supporting the submission, exploration, and management of projects and datasets to cellxgene.
MIT License
63 stars 12 forks source link

[Discovery] Optimize memory for schema migrations #6872

Closed joyceyan closed 5 months ago

joyceyan commented 6 months ago

Currently, we have a few datasets that are extremely large and require large amounts of memory to migrate them. These datasets were introduced to the corpus sometime between schema 4.0.0 and schema 5.0.0. To mitigate this with the schema 5 release, we would run the migration job with the same amount of memory we previously used, and then re-run it with more memory to catch the straggling datasets.

This isn't ideal because it introduces a bit more manual labor for the engineer running the migration. A few solutions:

  1. Bump the memory of this permanently. This isn't ideal since it would permanently increase our AWS costs.
  2. Investigate how to optimize the migration code to not require pulling in the entire dataset into memory.
  3. Investigate if there's a way to configure AWS to dynamically allocate the amount of memory needed for each individual dataset migration, rather than requesting a fixed amount.
  4. Accept the current state of things where engineers are expected to re-run the migration job with progressively more memory.

Acceptance criteria: Complete any necessary investigation and decide on an approach to solve this, and create a ticket to reflect actually executing that work.

danieljhegeman commented 6 months ago

We need to have a clear picture for how to more-appropriately deal with processing large datasets first. As suggested by @ebezzi , continuing to increase the memory usage is not sustainable as we take in larger and larger datasets. The word is that Census implements a more efficient way to manage large files. We should learn what we can from them to try to bring our memory requirements down from the stratosphere.

brianraymor commented 6 months ago

continuing to increase the memory usage is not sustainable as we take in larger and larger datasets

Per my response on the call yesterday, datasets larger than 50GB are not supported until there is a Data Generation scalability requirement in the Cell Science roadmap, which will result in end-to-end investments across CELLxGENE to address. For example, support for more than ~4M cells in Explorer.