NASA-IMPACT / csdap-cumulus

SmallSat Cumulus Deployment
Other
1 stars 0 forks source link

Migrate collection WV03_Pan_L1B to CBA Prod #321

Closed krisstanton closed 5 months ago

krisstanton commented 6 months ago

Migrate granules in collection WV03_Pan_L1B to CBA Prod by discovering/ingesting from existing prod account.

Acceptance criteria

To determine how many granules have been processed, first enter the Docker container:

DOTENV=.env.cba-prod make bash

In the container, run the following:

DEBUG=1 cumulus granules list -? collectionId=WV03_Pan_L1B___1 --limit=0 -? status=completed

(note: due to a Cumulus bug, sometimes the status does not get properly updated. Try running these to match the numbers)

DEBUG=1 cumulus granules list -? collectionId=WV03_Pan_L1B___1 --limit=0
DEBUG=1 cumulus granules list -? collectionId=WV03_Pan_L1B___1 --limit=0 -? status=queued
DEBUG=1 cumulus granules list -? collectionId=WV03_Pan_L1B___1 --limit=0 -? status=running
DEBUG=1 cumulus granules list -? collectionId=WV03_Pan_L1B___1 --limit=0 -? status=completed
DEBUG=1 cumulus granules list -? collectionId=WV03_Pan_L1B___1 --limit=0 -? status=failed

You should see output similar to the following:

...
RESPONSE: {
  statusCode: 200,
  body: '{"meta":{"name":"cumulus-api","stack":"cumulus-prod","table":"granule","limit":0,"page":1,"count":8592},"results":[]}',
  headers: {
    'x-powered-by': 'Express',
    'access-control-allow-origin': '*',
    'strict-transport-security': 'max-age=31536000; includeSubDomains',
    'content-type': 'application/json; charset=utf-8',
    'content-length': '114',
    etag: 'W/"72-O2wUXhu+Q9J1hqdDrb0fcsZeFHo"',
    date: 'Fri, 01 Dec 2023 21:29:19 GMT',
    connection: 'close'
  },
  isBase64Encoded: false
}
[]

In particular, look at the value for body and within it, locate the value of "count". In the output above, the count should match the Earthdata Search granule count obtained in the very first step.

krisstanton commented 6 months ago

The count before migration ingest from Earthdata Search is 2,474,403 Granules with inclusive date range: 2014-08-13 through 2022-08-01 00:00:00.

chuckwondo commented 5 months ago

Ingestion failed. Of the 3,287 Map Run executions, only 279 succeeded. Unfortunately, it appears that the tuning to avoid the timeouts of the QueueGranules lambda function actually made the problem worse.

There were 42171 granules processed with the following statuses:

Next steps:

krisstanton commented 5 months ago

Ran the Dead Letter archive recovery operation

cumulus dead-letter-archive recover-cumulus-messages

The number of running went from 1949 to 43 The number of queued went from 540 to 144

We made some more adjustments to the concurrency in order to hopefully address the timeout issues.

Also kicked off the ingest once again from the start.

chuckwondo commented 5 months ago

The new DiscoverAndQueueGranules execution 1a82609c-7ea2-4320-9875-a5dc61f10f87 failed as well, for the same reason: the QueueGranules step eventually failed due to timeouts too many times. It took 49 hours to get through only 1294 (including 99 failures) of 3287 days, which is only 39%.

Being nearly certain that the DB was the bottleneck, I dug and found that Cumulus sets the default RDS connection pool's maximum capacity to only 4, which is appropriate only for a sandbox deployment. (This actually makes sense as a default, so that large deployments are not the default, which would easily lead to large sandbox deployments by default.)

Therefore, I configured the Prod deployment to use a max of 384 (the largest possible setting), and I also dialed our concurrency back up, and changed the collection's "duplicateHandling" value from "replace" to "skip" so I could rerun the rule without reingesting the many granules already successfully ingested.

I manually redeployed to Prod and reran the rule with great success! Instead of getting through only 40% of discover/queue in 49 hours (with 99 failed QueueGranules), it discovered/queued 100% (3287 days) in only 9.5 hours with no QueueGranules errors! See DiscoverAndQueueGranules execution 21dd0fa8-0905-49b6-8b26-0a7d441636c9.

As of this writing, ~1.5M granules have been ingested and ~1M are in the queue. I estimate that ingestion should complete sometime in the afternoon US Eastern on Wed, Jan. 17.

Further, there are fewer than 500 ingest errors, according to our Athena query, and over 400 of them are due to throttling errors, as there was a point early on where we hit our concurrent Lambda execution quota, but that hasn't happened since. I suspect that once ingestion is complete, we should be able to recover messages from the dead letter archive to address most of these errors, perhaps leaving only a handful requiring manual reingestion.

There are a few more adjustments I will make after ingestion completes to address the throttling errors, making it either unlikely for them to recur, or at least to be able to retry after backoff, while also perhaps slightly increasing the ingestion rate from ~35K/hr to perhaps ~50K/hr.

This also means that we can also set duplicate handling for WV03_MSI_L1B to "skip" and rerun that ingestion as well, since it failed around the 75% mark, leaving roughly 25%+ uningested (see #311). Given the fixes here, we should be able to successfully ingest the rest of WV03_MSI_L1B without having to wait for the MCP file "cleanup" stage.

chuckwondo commented 5 months ago

Ingestion completed, with the following granule status counts:

Kicked off dead letter archive message recovery to see how many queued and running granules can be moved to completed (or failed):

cumulus dead-letter-archive recover-cumulus-messages

Output:

{
  "id": "5fa9e69d-b25e-43c2-ad52-b91132871d77",
  "description": "Dead-Letter Processor ECS Run",
  "operationType": "Dead-Letter Processing",
  "status": "RUNNING",
  "taskArn": "arn:aws:ecs:us-west-2:410469285047:task/cumulus-prod-CumulusECSCluster/fdc0074f6c424635989fd2e97a983475",
  "createdAt": 1705525783991,
  "updatedAt": 1705525783991
}
chuckwondo commented 5 months ago

Dead letter archive message recovery completed, updating granule status counts as follows:

Total added equals total removed (2368).

There were 82 messages that failed recovery:

$ cumulus async-operations get --id 5fa9e69d-b25e-43c2-ad52-b91132871d77 > 5fa9e69d-b25e-43c2-ad52-b91132871d77.json
$ jq -r '.output | fromjson | .processingFailedKeys[]' <5fa9e69d-b25e-43c2-ad52-b91132871d77.json | wc -l
82

For some reason, jq produces a parsing error when I attempt to pipe the output from async operation directly to it, so I had to first save the output to a file, then run jq against the file as input.

Running the Athena query to count errors from the latest execution produces the following results:

#   error_type          count
1   Lambda.TooManyRequestsException 423
2   MissingCmrFile          19
3   TypeError           5
4   NoSuchKey           2

This totals 423 + 19 + 5 + 2 = 449 errors. It's not immediately clear why there are an additional 659 - 449 = 210 granules with a failed status.

However, these are phenomenal results, especially because 423 errors are from briefly exceeding the quota on concurrent Lambda executions. Many (if not all) of them may be resolved by simply rerunning the ingestion. Since duplicates are now set to "skip" for this collection, the time to rerun should be on the order of 12 hours, rather than 2 1/2 days, since most of the 2 1/2 days is taken by the ingestion half, not the discovery half.

I have submitted a PR with a few more adjustments to concurrency and throttling that should avoid hitting the Lambda quota, and should also handle retries a bit more robustly. Kris has approved the PR, so I will merge and run a smoke test in UAT before deploying to Prod. Once deployed, I will kick off another "skip" ingest to see how many of the Lambda errors above can be corrected.

Regarding the TypeErrors, all 5 of them occurred at the same location within the Cumulus "MoveGranules" step:

TypeError: Function has non-object prototype 'undefined' in instanceof check,
     at Function.[Symbol.hasInstance] (<anonymous>),
     at Object.waitForObjectToExist (/var/task/webpack:/src/S3.ts:228:14),
     at runNextTicks (node:internal/process/task_queues:61:5),
     at listOnTimeout (node:internal/timers:528:9),
     at processTimers (node:internal/timers:502:7),
     at moveFileRequest (/var/task/webpack:/index.js:163:3),
     at async Promise.all (index 0),
     at /var/task/webpack:/index.js:222:24,
     at async Promise.all (index 0),
     at moveFilesForAllGranules (/var/task/webpack:/index.js:234:3)]

It turns out that these are due to timeouts waiting for S3 objects to exist (likely after they've been moved), but due to a Cumulus bug (CUMULUS-3559), which I just reported, they show up as TypeErrors instead of TimeoutErrors.

Regarding the NoSuchKey errors, they are all due to files missing from the file-staging area.

Upon further investigation, there are actually 6 NoSuchKey errors and 6 TypeErrors. Given that the TypeErrors are due to timeouts waiting for objects to exist and a NoSuchKey error means that an object doesn't exist, it seems reasonable to conclude that each timeout error (TypeError) corresponds to a NoSuchKey error. However, it's unclear why the Athena query doesn't show 6 of each type of error.

Of course, due to the bug I reported, we cannot absolutely conclude the relationship between the 2 groups of errors because the TypeError messages do not include the names of the objects in question, but once the bug is fixed and TimeoutErrors are correctly reported, their error messages will include the names of the objects being waited on.

Regardless, rerunning the ingestion should ideally clear all of the errors above, excluding the MissingCmrFile errors.

Regarding the MissingCmrFile errors, here are the relevant granule IDs:

WV03_20191021024156_1040010052825000_19OCT21024156-P1BS-504111264070_01_P008
WV03_20190410180929_104001004B824300_19APR10180929-P1BS-503171167070_01_P009
WV03_20210212073136_104001006646DA00_21FEB12073136-P1BS-505385700020_01_P005
WV03_20171022104513_10400100346DAC00_17OCT22104513-P1BS-501682839070_01_P002
WV03_20181124125834_1040010045CE4100_18NOV24125834-P1BS-502828286100_01_P007
WV03_20210413112234_1040010068C2D700_21APR13112234-P1BS-505335010030_01_P008
WV03_20171007180514_1040010033283B00_17OCT07180514-P1BS-505549406040_01_P002
WV03_20210509160906_10400100689A5800_21MAY09160906-P1BS-505334993010_01_P003
WV03_20190101113507_1040010043390D00_19JAN01113507-P1BS-504321731040_01_P002
WV03_20180721030913_104001003F2B6900_18JUL21030913-P1BS-503288130050_01_P009
WV03_20191020075611_1040010054194E00_19OCT20075611-P1BS-505817655080_01_P010
WV03_20190430230936_1040010049901400_19APR30230936-P1BS-505721104050_01_P010
WV03_20170726183108_104001003001E200_17JUL26183108-P1BS-504281958060_01_P002
WV03_20210823160302_104001006D4C9000_21AUG23160302-P1BS-505738595030_01_P005
WV03_20171221183800_1040010037684D00_17DEC21183800-P1BS-501968395090_01_P006
WV03_20210310170306_1040010068A10C00_21MAR10170306-P1BS-505291781050_01_P003
WV03_20220711105137_104001007913C900_22JUL11105137-P1BS-506699768080_01_P004
WV03_20190410075904_104001004AAA0900_19APR10075904-P1BS-503057490050_01_P003
WV03_20181113182125_10400100443AEC00_18NOV13182125-P1BS-502892317040_01_P003

I will rerun the ingestion to attempt to correct all errors other than the MissingCmrFile errors.

chuckwondo commented 5 months ago

I kicked off reingest again: DiscoverAndQueueGranules execution 66d2644f-8d35-4aa9-a1e0-950cbc827289

chuckwondo commented 5 months ago

The reingest took only 3h10m, but didn't make any difference in the status counts. Unfortunately, I believe that when Cumulus skips duplicates, it skips them no matter what their statuses are. I had assumed that it would skip only existing granules with a status of completed, but it skips them no matter the status. It would be nice if there were a different duplicate handling option, perhaps named replace-incomplete or skip-completed.

Given that the reingest made no difference in the counts, I used the Cumulus CLI to reingest granules individually, first by obtaining the list of granuleIds for granules with statuses of queued, running, or failed:

cumulus granules list -? collectionId=WV03_Pan_L1B___1 -? status__in=queued,running,failed --fields granuleId --all > WV03_Pan_L1B___1-reingest.txt

Then I used the output file as input to reingest each one:

jq -r .[].granuleId < WV03_Pan_L1B___1-reingest.txt | xargs -L1 cumulus granules reingest --collection-id WV03_Pan_L1B___1 --granule-id

This managed to fix some of the statuses, where counts are now as follows:

- completed: 2527514 (added 253)
- failed: 14 (removed 645)
- queued: 568 (added 438)
- running: 0 (removed 46)

The failed count is all for the missing CMR files, and the increased queued count is at least partially (if not fully) due to the bug CUMULUS-3557 that I recently reported.

This means that we will need to take a lower level approach to try to clean those up. Specifically, we'll need to make use of the AWS CLI to pull input payloads for the failed executions, populate the missing provider details (due to the bug linked above), and run new executions with the payloads.

chuckwondo commented 5 months ago

Marking complete, as there are <1000 failures (<0.04%), which will be cleaned up later.

krisstanton commented 5 months ago

The reingest took only 3h10m, but didn't make any difference in the status counts. Unfortunately, I believe that when Cumulus skips duplicates, it skips them no matter what their statuses are. I had assumed that it would skip only existing granules with a status of completed, but it skips them no matter the status. It would be nice if there were a different duplicate handling option, perhaps named replace-incomplete or skip-completed.

Given that the reingest made no difference in the counts, I used the Cumulus CLI to reingest granules individually, first by obtaining the list of granuleIds for granules with statuses of queued, running, or failed:

cumulus granules list -? collectionId=WV03_Pan_L1B___1 -? status__in=queued,running,failed --fields granuleId --all > WV03_Pan_L1B___1-reingest.txt

Then I used the output file as input to reingest each one:

jq -r .[].granuleId < WV03_Pan_L1B___1-reingest.txt | xargs -L1 cumulus granules reingest --collection-id WV03_Pan_L1B___1 --granule-id

This managed to fix some of the statuses, where counts are now as follows:

- completed: 2527514 (added 253)
- failed: 14 (removed 645)
- queued: 568 (added 438)
- running: 0 (removed 46)

The failed count is all for the missing CMR files, and the increased queued count is at least partially (if not fully) due to the bug CUMULUS-3557 that I recently reported.

This means that we will need to take a lower level approach to try to clean those up. Specifically, we'll need to make use of the AWS CLI to pull input payloads for the failed executions, populate the missing provider details (due to the bug linked above), and run new executions with the payloads.

This is great, thanks for putting the process of reingesting a single (or group of individual) of granules in one place like this. Bookmarking!