During our attempt to migrate WV03_MSI_L1B (#311) to CBA Prod, ingestion failed after migrating roughly 75% of the entire collection (~1.45M of ~1.9M granules). Generally speaking, the new workflow scalability implementation to allow ingestion of collections in their entirety (rather than year by year), worked well. However, during the course of discovery and queuing of granules, the execution time of the QueueGranules lambda function grew over time, until it eventually reached the 15-minute timeout limit. This occurred after queuing roughly 75% of the granules.
We must do some tuning to prevent this from occurring so that entire collections can be ingested during a single rule execution.
It is difficult to precisely pinpoint the cause of this slow growth in execution time of QueueGranules, but I suspect it is due to the growth of the DB coupled with concurrent access to it.
Here are some factors that may contribute to this:
by default (but configurable), QueueGranules has a concurrency of 3, which likely means that 3 concurrent DB writes may occur when updating the status (to "queued") for 3 concurrently queued granules
compounding the previous point, our workflow executes up to 10 concurrent instances of QueueGranules
each instance of QueueGranules receives up to 1,000 granules at a time
I suggest we do the following to reduce (ideally, eliminate) the possibility of QueueGranules timing out:
set concurrency of QueueGranules to 1, since we are creating our own concurrency by concurrently running up to 10 instances of QueueGranules
During our attempt to migrate WV03_MSI_L1B (#311) to CBA Prod, ingestion failed after migrating roughly 75% of the entire collection (~1.45M of ~1.9M granules). Generally speaking, the new workflow scalability implementation to allow ingestion of collections in their entirety (rather than year by year), worked well. However, during the course of discovery and queuing of granules, the execution time of the QueueGranules lambda function grew over time, until it eventually reached the 15-minute timeout limit. This occurred after queuing roughly 75% of the granules.
We must do some tuning to prevent this from occurring so that entire collections can be ingested during a single rule execution.
It is difficult to precisely pinpoint the cause of this slow growth in execution time of QueueGranules, but I suspect it is due to the growth of the DB coupled with concurrent access to it.
Here are some factors that may contribute to this:
I suggest we do the following to reduce (ideally, eliminate) the possibility of QueueGranules timing out: