aws / copilot-cli

The AWS Copilot CLI is a tool for developers to build, release and operate production ready containerized applications on AWS App Runner or Amazon ECS on AWS Fargate.
https://aws.github.io/copilot-cli/
Apache License 2.0
3.52k stars 414 forks source link

Health Check Stopped Working #4781

Closed SoundsSerious closed 1 year ago

SoundsSerious commented 1 year ago

Having worked through my deployment issues months ago, I was quite satisfied with the copilot pipelines until I tried to deploy a minor update yesterday.

I was shocked to see that my health checks were failing, but when I looked at the logs I didn't actually see any requests being made from the ELB health check system, although I could reach the instance and verify it was listening and working.

The only change I made recently was to add DNSSEC to the domain copilot was setup on. Could this have caused this issue, or is it something more concerning?

Things I checked:

  1. security groups are allowing traffic through on the NLB port 8080
  2. VPC Flow Logs show that health checks are being accepted on port 8080
  3. the instance logs aren't showing any health requests, but show requests I made
bvtujo commented 1 year ago

Hmm, this is definitely confusing. Let me check with someone on the team who may have more context.

For the time being, is it possible to turn DNSSEC off for a while and see if the health checks resume?

SoundsSerious commented 1 year ago

@bvtujo Thanks that would be great!

I think I have to leave DNSSEC on as I have been having issues with my domain getting flagged from google safe browsing

Generally, I find this very strange as when the system reverts to the previous container that accepts health checks given the same copilot-created infrastructure.

A couple of other things I have tried:

  1. brought the ECR image locally an verified it accepts connections when the network was configured properly
  2. I ran the reachability analyzer, which said that the request wasn't permitted due to the security group mismatch, which allowed traffic on port 8080 to my NLB-backed instance on 8080 shown below. It did determine the local IP of the instance from the public IP. I opened up the SG to allow all traffic and the reachability agent said the instance was reachable but then the health checks didn't work either. image Pasted image 20230421205512
  3. I removed the service with the CLI and reinstalled it, same issue persisted.
SoundsSerious commented 1 year ago

My next step will be to delete the app and see if that works when I recreate it.

If DNSSEC is the problem I think the best future-facing fix would be to enable it by default with domain option.

SoundsSerious commented 1 year ago

while using copilot app delete I ran into two issues related to DNSSEC. From cloud formation stack events:

Received response status [FAILED] from custom resource. Message returned: Missing credentials in config, if using AWS_CONFIG_FILE, set AWS_SDK_LOAD_CONFIG=1 (Log: /aws/lambda/smxcore-dev-DNSDelegationFunction-3sMVBFbmeERo/2023/04/22/[$LATEST]2f16d7e24a064b60bc724ae874b6d2c7) (RequestId: 4bcf6945-0c1d-4451-822c-2be47d3e6b05)
The specified hosted zone contains DNSSEC Key Signing Keys and so cannot be deleted. (Service: Route53, Status Code: 400, Request ID: 9eb38603-90e9-4aee-a9cf-42fc066d45ce)
SoundsSerious commented 1 year ago

I wound up deleting my application and rebuilding the app. With this, I have successfully pushed the problem container.

I am really liking Copilot, but these kinds of issues where I have to redeploy my entire app make me like it less :) I'm not sure how I'm going to handle the downtime while I wait for CF to reestablish a Copilot app when I get to production.

A couple of ideas:

  1. fix cloud formation :)
  2. treat "apps" as a deployable instance timestamped with its own resources, then switch between them with some route 53 magic. Clean up apps when the deployment is fixed. Maybe this is called "migrate"? Pipelines might need to distinguish between container changes and app changes for migrate vs deploy?
SoundsSerious commented 1 year ago

A second app I have on the same domain didn't require recreation to push an update successfully. I noticed that those two had Cloud Maps on the hosted zone list whereas my other app didn't. On recreation the first app added a Cloud Map as well.

bvtujo commented 1 year ago

This is a very confusing series of events, and I would love to get to the bottom of it. So as far as I understand, you have deployed a LBWS with a network load balancer and HTTP disabled so there's no ALB.

You have a pipeline running which has happily deployed many commits.

You then enabled DNSSEC on the Copilot hosted zone. This caused the next deployment to start failing health checks from the NLB.

You were able to fix the problem by deleting your whole app, then recreating it. From what you've said, it appears that the first app did not enable CloudMap service discovery, which is definitely confusing as we don't provide a way to disable that when creating an environment or service.

What Copilot version were you on when you originally created the problematic app, and what version are you on now?

And are you using the alias functionality for your DNS? You can specify alternative subdomains for your apps other than what copilot creates for you (svc.env.app.domain.com), which might provide an avenue for switching traffic between different versions of the app. You could also maintain a CNAME record at your app's outward facing address which you point to svc.env.app.domain.com and switch the record if you need to migrate.

SoundsSerious commented 1 year ago

@bvtujo Thanks for getting back to me on this!

Your synopsis is generally correct. I had been deploying to this app/environment frequently for about two months.

I am not sure what happened to the cloud map service discovery. I had a previous app with the same config I deleted and recreated it to use the domain functionality. I wonder if that's where the problem originated.

What I find interesting about the health checks/reachability is that the reachability checks located and translated the external IP to the local IP, but security rules which should have allowed the request didn't.

I don't really have good insight to how the internals of AWS networking work but it seems like DNSSEC created a new instance or instances somewhere in the system and one or more child networking instances (SG for ex) didn't associate.

I am using copilot version v1.25.0 in my pipeline.

I am using the default svc.env.app.domain pattern. I like your ideas for aliases regarding external systems. We were going to mount the app under our primary domain with something similar.

SoundsSerious commented 1 year ago

@bvtujo Hi once again seeing this issue popping up out of no where on the 3 week old deploy, deployed to it succesfully at 1 week intervals. Checking the copilot logs and once again I see no health check requests made.

Why is this happening? Shouldn't health checks be able to reach an AWS instance by default from its own load balancer?

The load balancer in question is arn:aws:elasticloadbalancing:us-east-1:753767448166:targetgroup/smxcor-NLBTa-OUGD1EEHLWBP/2db4dab7dd84d19e.

It would be very very very helpful if someone from AWS could look at this.

Update: I deactivated the failing service and my other ALB is failing as well without any health checks reaching it. `arn:aws:elasticloadbalancing:us-east-1:753767448166:targetgroup/smxcor-Targe-JPBYKWYVFH7K/7603934b88b7eaee

I can see the previous version in the logs recieving health checks but the deployed instance is not receiving them. This is a clear indication that something internal at AWS is broken, since the underlying config is the same and managed by ECS. Instance 85e80aca2ded doesn't receive, while 1a75ac0df8f0 does

copilot/edit/1a75ac0df8f0 16:54:05 [ON][EditSystem->ArangoDB                ] live objects  4407
copilot/edit/1a75ac0df8f0 16:54:05 [ON][EditSystem->ArangoDB                ] num tx 49
copilot/edit/1a75ac0df8f0 16:54:05 [ON][EditSystem->ArangoDB                ] num pend rel 0
copilot/edit/1a75ac0df8f0 16:54:05 [ON][EditSystem->ArangoDB                ] active jobs 52
copilot/edit/1a75ac0df8f0 16:54:05 [JSONWebTokenCredentialFactory           ] updating security keys...
copilot/edit/1a75ac0df8f0 16:54:06 [twisted.python.log                      ] "10.0.1.249" - - [16/May/2023:16:54:05 +0000] "GET /health HTTP/1.1" 200 17 "-" "ELB-HealthChecker/2.0"
copilot/edit/1a75ac0df8f0 16:54:20 [ON][ModelManager->sort                  ] not able to make request True|True|False|0|30.0 vs 12.963716
copilot/edit/1a75ac0df8f0 16:54:30 [UserAccessRealm                         ] 0 Active Users...
copilot/edit/85e80aca2ded 16:54:32 [UserAccessRealm                         ] 0 Active Users...
copilot/edit/1a75ac0df8f0 16:54:33 [twisted.python.log                      ] "10.0.0.14" - - [16/May/2023:16:54:32 +0000] "GET /health HTTP/1.1" 200 17 "-" "ELB-HealthChecker/2.0"
copilot/edit/1a75ac0df8f0 16:54:36 [twisted.python.log                      ] "10.0.1.249" - - [16/May/2023:16:54:35 +0000] "GET /health HTTP/1.1" 200 17 "-" "ELB-HealthChecker/2.0"
copilot/edit/1a75ac0df8f0 16:55:00 [UserAccessRealm                         ] 0 Active Users...
copilot/edit/85e80aca2ded 16:55:02 [UserAccessRealm                         ] 0 Active Users...
copilot/edit/1a75ac0df8f0 16:55:02 [twisted.python.log                      ] "10.0.0.14" - - [16/May/2023:16:55:01 +0000] "GET /health HTTP/1.1" 200 17 "-" "ELB-HealthChecker/2.0"
copilot/edit/1a75ac0df8f0 16:55:04 [ON][EditSystem.EditSystem               ] cur mem: 1.134GB | cpu 60.30%
copilot/edit/85e80aca2ded 16:55:05 [ON][EditSystem->S3                      ] live objects  0
copilot/edit/85e80aca2ded 16:55:05 [ON][EditSystem->S3                      ] num tx 0
copilot/edit/85e80aca2ded 16:55:05 [ON][EditSystem->S3                      ] num pend rel 0
copilot/edit/85e80aca2ded 16:55:05 [ON][EditSystem->S3                      ] active jobs 3
copilot/edit/85e80aca2ded 16:55:05 [ON][EditSystem->ArangoDB                ] live objects  2569
SoundsSerious commented 1 year ago

This second issue seemed to be related to a misconfig in grace period