HumanCellAtlas / dcp2

Shared artifacts concerning the Human Cell Atlas (HCA) Data Coordination Platform (DCP)
4 stars 2 forks source link

Analysis subgraphs might be too big for Azul #50

Open hannes-ucsc opened 2 years ago

hannes-ucsc commented 2 years ago

Azul has 15 minutes and 10GiB of memory to index an analysis subgraph and all its stitched-on input subgraphs, aka the subgraph tree. So 15min and 10GiB for one subgraph tree. Those limits are imposed by AWS Lambda.

We hit the 10GiB memory limit with an existing SS2 analysis subgraph with 17000 files but were able to work around that by indexing it in "partitions". With partitioning we still load the entire subgraph tree from TDR but then only incorporate one slice of it (the partition) into the index. The partitions are then later combined in a separate process, outside of the original 15min/10GiB "box", in another "box" of the same size. We can make the partitions arbitrarily small. The problem is that we still have to load the entire subgraph tree from TDR before we can determine a partition to process. With partitioning enabled, the loading already eats up 10min of those 15min, leaving only 5min to process the partition. And this is where we'll hit the next limit: probably at 25k files, the entire 15min will be spent loading, leaving no time for the processing. The other problem is that we redundantly load many JSON documents multiple times (once per partition) before essentially discarding most of them.

The important observation is that no amount of subgraph partitioning by analysis, say by donor, is going to alleviate that problem, because all those partitions would still need to be stitched by Azul when it indexes the subgraph that combines the partitions into the final loom. The only thing that would help is if we didn't offer a top-level loom, only the donor-level looms. And that would obviously not be useful to our customers.

hannes-ucsc commented 2 years ago

Quoting @kbergin from an email exchange with Kylee, Trevor, Akum, Jessica, Nikelle and Wes

Thanks for meeting today to brainstorm. Here is a summary, please respond if I've missed anything or am incorrect.

Summary of situation: Azul is having a difficult time indexing subgraphs that are a large number of files. Because SS2 projects often have thousands of input samples that all converge into one final output file, the subgraphs get very very large. So far it can scale to 10,000 files, but not to 20,000. We aren't currently sure where the scale limitations lie and that leaves us with open questions on the best path forward. A few ideas proposed:

  1. Stratify further by donor. We considered stratifying by donors as an intermediate and then producing a final merged loom. We realized this final merged loom would still have the same indexing issues. We could consider delivering looms stratified by donor, but this needs product user research to determine how this impacts users. Would require work on pipelines team to implement stratification change. Would result in different stratifying method than 10X.
  2. Run the largest stratification we have as a scale test to identify where scaling issues are. Ideal to run in dev, but on 'TDR new prod'. Therefore, we should wait until that is complete before running this scale test. Note to pipelines team: How much would this test cost in dev? Do cached references get used in this pipeline and would that be used in dev?
  3. Remove bam index from delivery to DCP - would halve the number of files at that level. Not solving the problem but doesn't hurt. Ticket created and Lantern will do that regardless of other testing and scaling work.
  4. After test in 2 - work in Azul to figure out how to scale to larger datasets in parallel to exploring 1.

Generally, we think this is not urgent enough to drop other work because all of the current SS2 datasets in the DCP are not as large as Tabula Muris and should be able to be indexed.

We can come back to this in the future.

hannes-ucsc commented 2 years ago

@kbergin later on that same thread:

I don't know if donor level looms would be completely unusable or problematic for users, we would need to research. But I do think it's not ideal, if we have other solutions.

Sounds like removing the bam index will help as a first step for at least getting through the rest of these processed datasets sooner. We will plan on that.