QueueGranules times out with sufficiently large list of granules to queue

chuckwondo commented 1 year ago

When the DiscoverGranules function discovers a sufficiently large number of granules, the QueueGranules function times out while attempting to queue them all. This is due to a highly inefficient implementation of QueueGranules. This is a known problem, and at the time of this writing, there is a PR open in the Cumulus repo for addressing this.

However, we don't have to wait for that fix to be published, which would also require a Cumulus upgrade. Instead, we can simply convert the QueueGranules lambda function to an ECS task.

For reference, see Example: Replacing AWS Lambda with a Docker container run on ECS, but the following steps provide the specific steps for our project, so the docs at that link were the original instructions for the following specifics.

Since we previously had QueueGranules configured as an ECS task, we can add the following to app/stack/cumulus/main.tf as found from the git logs:

resource "aws_sfn_activity" "queue_granules" {
  name = "${var.prefix}-QueueGranules"
}

module "queue_granules_service" {
  source = "https://github.com/nasa/cumulus/releases/download/<%= cumulus_version %>/terraform-aws-cumulus-ecs-service.zip"

  prefix = var.prefix
  name   = "QueueGranules"

  cluster_arn   = module.cumulus.ecs_cluster_arn
  desired_count = 1
  image         = local.ecs_task_image

  cpu                = local.ecs_task_cpu
  memory_reservation = local.ecs_task_memory_reservation

  environment = {
    AWS_DEFAULT_REGION = data.aws_region.current.name
  }
  command = [
    "cumulus-ecs-task",
    "--activityArn",
    aws_sfn_activity.queue_granules.id,
    "--lambdaArn",
    module.cumulus.queue_granules_task.task_arn
  ]
  alarms = {
    MemoryUtilizationHigh = {
      comparison_operator = "GreaterThanThreshold"
      evaluation_periods  = 1
      metric_name         = "MemoryUtilization"
      statistic           = "SampleCount"
      threshold           = 75
    }
  }
}

Further, within the same file, change this line:

    queue_granules_task_arn : module.cumulus.queue_granules_task.task_arn,

to this:

    queue_granules_task_arn : aws_sfn_activity.queue_granules.id,

Acceptance Criteria

[x] Make the changes indicated above, deploy to the sandbox, and run a smoke test to confirm successful ingestion (since the log group will be different for QueueGranules, you'll need to tail the log group named ${CUMULUS_PREFIX}-QueueGranulesEcsLogs -- note that there will be a fair bit of "noise" in the ECS logs due to Cumulus's automatic "heartbeat" messages, which can be ignored)
[x] Submit PR
[x] Once PR is approved, merge it, and await automatic deployment to UAT, then run smoke test in UAT.
[x] If smoke test in UAT succeeds, manually approve prod deployment.
[x] Await prod deployment, and once complete, rerun the PSScene3Band___1_2019_H1_finish_link_updates rule in prod and confirm that it succeeds (it's earlier failure was due to this timeout issue, so its success will confirm the changes above fix the bug). Reference Ticket: https://github.com/NASA-IMPACT/csdap-cumulus/issues/174

krisstanton commented 1 year ago

Verified that the log group exists cumulus-uat-QueueGranulesEcsLogs Also verified that the smoketest worked

Deposited newly overwritten files in S3
Log entry fragment: {"executions":"fdc9e5a1-764f-403f-87b8-2023838d1be8","granules":"[\"PSScene3Band-20171201_031958_0f31\",\"PSScene3Band-20171201_031959_0f31\",\"PSScene3Band-20171201_032000_0f31\"]"

krisstanton commented 1 year ago

This task and https://github.com/NASA-IMPACT/csdap-cumulus/issues/174 are successfully completed.

NASA-IMPACT / csdap-cumulus

QueueGranules times out with sufficiently large list of granules to queue #185

Acceptance Criteria