Closed joyceyan closed 5 months ago
We need to have a clear picture for how to more-appropriately deal with processing large datasets first. As suggested by @ebezzi , continuing to increase the memory usage is not sustainable as we take in larger and larger datasets. The word is that Census implements a more efficient way to manage large files. We should learn what we can from them to try to bring our memory requirements down from the stratosphere.
continuing to increase the memory usage is not sustainable as we take in larger and larger datasets
Per my response on the call yesterday, datasets larger than 50GB are not supported until there is a Data Generation scalability requirement in the Cell Science roadmap, which will result in end-to-end investments across CELLxGENE to address. For example, support for more than ~4M cells in Explorer.
Currently, we have a few datasets that are extremely large and require large amounts of memory to migrate them. These datasets were introduced to the corpus sometime between schema 4.0.0 and schema 5.0.0. To mitigate this with the schema 5 release, we would run the migration job with the same amount of memory we previously used, and then re-run it with more memory to catch the straggling datasets.
This isn't ideal because it introduces a bit more manual labor for the engineer running the migration. A few solutions:
Acceptance criteria: Complete any necessary investigation and decide on an approach to solve this, and create a ticket to reflect actually executing that work.