NASA-IMPACT / csdap-cumulus

SmallSat Cumulus Deployment
Other
1 stars 1 forks source link

Migrate collection WV02_MSI_L1B to CBA Prod #323

Closed krisstanton closed 8 months ago

krisstanton commented 9 months ago

Migrate granules in collection WV02_MSI_L1B to CBA Prod by discovering/ingesting from existing prod account.

Acceptance criteria

To determine how many granules have been processed, first enter the Docker container:

DOTENV=.env.cba-prod make bash

In the container, run the following:

DEBUG=1 cumulus granules list -? collectionId=WV02_MSI_L1B___1 --limit=0 -? status=completed

(note: due to a Cumulus bug, sometimes the status does not get properly updated. Try running these to match the numbers)

DEBUG=1 cumulus granules list -? collectionId=WV02_MSI_L1B___1 --limit=0
DEBUG=1 cumulus granules list -? collectionId=WV02_MSI_L1B___1 --limit=0 -? status=queued
DEBUG=1 cumulus granules list -? collectionId=WV02_MSI_L1B___1 --limit=0 -? status=running
DEBUG=1 cumulus granules list -? collectionId=WV02_MSI_L1B___1 --limit=0 -? status=completed
DEBUG=1 cumulus granules list -? collectionId=WV02_MSI_L1B___1 --limit=0 -? status=failed

You should see output similar to the following:

...
RESPONSE: {
  statusCode: 200,
  body: '{"meta":{"name":"cumulus-api","stack":"cumulus-prod","table":"granule","limit":0,"page":1,"count":8592},"results":[]}',
  headers: {
    'x-powered-by': 'Express',
    'access-control-allow-origin': '*',
    'strict-transport-security': 'max-age=31536000; includeSubDomains',
    'content-type': 'application/json; charset=utf-8',
    'content-length': '114',
    etag: 'W/"72-O2wUXhu+Q9J1hqdDrb0fcsZeFHo"',
    date: 'Fri, 01 Dec 2023 21:29:19 GMT',
    connection: 'close'
  },
  isBase64Encoded: false
}
[]

In particular, look at the value for body and within it, locate the value of "count". In the output above, the count should match the Earthdata Search granule count obtained in the very first step.

krisstanton commented 8 months ago

The count before migration ingest from Earthdata Search is 3,086,582 Granules with inclusive date range: 2009-01-01T00:00:00Z through 2023-01-01T00:00:00Z.

krisstanton commented 8 months ago

Ingest is running (Started at around 11:00 am central on Jan 18, 2024) State Machine: cumulus-prod-DiscoverAndQueueGranules Execution: 8eb32630-c25a-4d55-8aab-bb1f24846205 So far, 339 succeeded out of 5113 total. 0 errors.

Currently awaiting some of the successful granules to get ingested so they can be verified.

chuckwondo commented 8 months ago

Things are chugging along very quickly and smoothly in the discovery/queue phase, which looks like it should finish in a total of under 6 hours!

The ingest/publish phase is also running very smoothly, but appears to be overly throttled, with a cap of 18K granules/hour, which is about half of what we achieved with WV03_MSI_L1B. In my effort to avoid hitting the AWS quota on concurrent Lambda executions, I made an adjustment that is causing this reduced rate because I missed changing a corresponding config at the same time.

Unfortunately, this means that ingesting/publishing ~3M granules will take ~7 days, when we'd like to see it take only ~3 days (i.e., we'd like to achieve a rate of >1M/day or >41.7K/hr, ideally more like ~60K/hr or ~1K/min).

I'm going to open a new issue to address this, which we can try with our next ingestion (in about a week, ugh!)

chuckwondo commented 8 months ago

Ingestion completed, with the following results:

This is fantastic to see no granules left in either queued or running status.

Further, the failures are broken down as follows:

   2905 Error
    114 MissingCmrFile
      2 NoSuchKey
      6 TypeError

The generic "Error" failures all seem to be CMR errors:

{
  "errorType": "Error",
  "errorMessage": "Failed to ingest, statusCode: 401, statusMessage: Unauthorized, CMR error message: [\"You do not have permission to perform that action.\"]",
  "trace": [
    "Error: Failed to ingest, statusCode: 401, statusMessage: Unauthorized, CMR error message: [\"You do not have permission to perform that action.\"]",
    "    at CMR.ingestUMMGranule (/var/task/webpack:/src/CMR.ts:259:13)",
    "    at runMicrotasks (<anonymous>)",
    "    at processTicksAndRejections (node:internal/process/task_queues:96:5)",
    "    at publishUMMGJSON2CMR (/var/task/webpack:/src/cmr-utils.js:184:15)",
    "    at publish2CMR (/var/task/webpack:/src/cmr-utils.js:230:12)",
    "    at async Promise.all (index 0)",
    "    at postToCMR (/var/task/webpack:/index.js:131:19)",
    "    at Object.runCumulusTask (/var/task/webpack:/node_modules/@cumulus/cumulus-message-adapter-js/dist/cma.js:221:1)",
    "    at Runtime.handler (/var/task/webpack:/index.js:159:10)"
  ]
}

I ran the following command in an attempt to rectify as many failures as possible:

cumulus dead-letter-archive recover-cumulus-messages

The command output the following:

{
  "id": "461e38a5-4a87-4bff-ad50-8305f4233397",
  "description": "Dead-Letter Processor ECS Run",
  "operationType": "Dead-Letter Processing",
  "status": "RUNNING",
  "taskArn": "arn:aws:ecs:us-west-2:410469285047:task/cumulus-prod-CumulusECSCluster/dda8064e1d144c64b32de883d2b7a2da",
  "createdAt": 1706230922687,
  "updatedAt": 1706230922687
}

The async operation succeeded, did not affect the failure count.

Therefore, I used the same approach to reingest the failures as described in https://github.com/NASA-IMPACT/csdap-cumulus/issues/321#issuecomment-1898760714

I will report final results once all reingestions complete.

chuckwondo commented 8 months ago

The granule status counts are now as follows:

Therefore, of the original 3027 failures, 821 were completed, 2145 became stuck in queued, and 61 remain failed.

Summary of failed statuses:

$ 2>/dev/null cumulus granules list --all -? collectionId=WV02_MSI_L1B___1 -? status=failed > WV02_MSI_L1B-failed.json
$ <WV02_MSI_L1B-failed.json jq -r '.[].error.errors | fromjson | .[0].error' | sort | uniq -c
     61 MissingCmrFile

Summary of queued statuses:

$ 2>/dev/null cumulus granules list --all -? collectionId=WV02_MSI_L1B___1 -? status=queued > WV02_MSI_L1B-queued.json
$ <WV02_MSI_L1B-queued.json jq -r '.[].error.errors | fromjson | .[0].error' | sort | uniq -c
   2091 Error
     53 MissingCmrFile
      1 NoSuchKey

This gives us a total of 114 granules with missing CMR files. The 2091 generic "Error" errors all seem to be CMR permission errors, for some odd reason:

Error: Failed to ingest, statusCode: 401, statusMessage: Unauthorized, CMR error message: ["You do not have permission to perform that action."]
    at CMR.ingestUMMGranule (/var/task/webpack:/src/CMR.ts:259:13)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at publishUMMGJSON2CMR (/var/task/webpack:/src/cmr-utils.js:184:15)
    at publish2CMR (/var/task/webpack:/src/cmr-utils.js:230:12)
    at async Promise.all (index 0)
    at postToCMR (/var/task/webpack:/index.js:131:19)
    at Object.runCumulusTask (/var/task/webpack:/node_modules/@cumulus/cumulus-message-adapter-js/dist/cma.js:221:1)
    at Runtime.handler (/var/task/webpack:/index.js:159:10)