Enhance DiscoverAndQueueGranules workflow to allow unlimited scalability

chuckwondo commented 11 months ago

Currently, the DiscoverAndQueueGranules workflow is far more scalable than the out-of-box workflow provided by the core Cumulus examples. Out of the box, s3 discovery collapses at around 500K files (regardless of the number of files per granule), depending upon Lambda configuration, or use of an ECS task in place of a Lambda function.

With the currently "auto chunking", looping logic in the workflow, the number of files that can be discovered would be unlimited, if it weren't for an AWS limit on the number of events in an executing step function, which is 25000. By very rough calculations, this allows us to ingest a span of about 2.5 years of granules. However, since constructing Cumulus rules to span 2.5 years is a bit cumbersome and unintuitive, so we currently construct 1 rule per year for each collection

The ideal situation (while still leveraging existing s3 discovery capabilities) would be to create 1 rule per collection, spanning the entirety of the temporal range of the collection, regardless of how many files that includes. This was the original goal of the "auto chunking", looping workflow, until the 25K event limit on step function executions was reached.

More recently, I discovered the ability of "Map" tasks within step functions to support a "distributed" mode, which means that each "iteration" of a Map task is treated as a separate execution, thus not contributing to the event count of the main workflow. This further means that we can replace the looping logic with a distributed Map task, and thus avoid getting anywhere close to the 25K event limit an any individual workflow or Map task.

chuckwondo commented 11 months ago

Related PR: #278

chuckwondo commented 11 months ago

Fixed by #278

NASA-IMPACT / csdap-cumulus

Enhance DiscoverAndQueueGranules workflow to allow unlimited scalability #274