Closed SoundsSerious closed 1 year ago
Hmm, this is definitely confusing. Let me check with someone on the team who may have more context.
For the time being, is it possible to turn DNSSEC off for a while and see if the health checks resume?
@bvtujo Thanks that would be great!
I think I have to leave DNSSEC on as I have been having issues with my domain getting flagged from google safe browsing
Generally, I find this very strange as when the system reverts to the previous container that accepts health checks given the same copilot-created infrastructure.
A couple of other things I have tried:
My next step will be to delete the app and see if that works when I recreate it.
If DNSSEC is the problem I think the best future-facing fix would be to enable it by default with domain option.
while using copilot app delete
I ran into two issues related to DNSSEC. From cloud formation stack events:
Received response status [FAILED] from custom resource. Message returned: Missing credentials in config, if using AWS_CONFIG_FILE, set AWS_SDK_LOAD_CONFIG=1 (Log: /aws/lambda/smxcore-dev-DNSDelegationFunction-3sMVBFbmeERo/2023/04/22/[$LATEST]2f16d7e24a064b60bc724ae874b6d2c7) (RequestId: 4bcf6945-0c1d-4451-822c-2be47d3e6b05)
The specified hosted zone contains DNSSEC Key Signing Keys and so cannot be deleted. (Service: Route53, Status Code: 400, Request ID: 9eb38603-90e9-4aee-a9cf-42fc066d45ce)
I wound up deleting my application and rebuilding the app. With this, I have successfully pushed the problem container.
I am really liking Copilot, but these kinds of issues where I have to redeploy my entire app make me like it less :) I'm not sure how I'm going to handle the downtime while I wait for CF to reestablish a Copilot app when I get to production.
A couple of ideas:
A second app I have on the same domain didn't require recreation to push an update successfully. I noticed that those two had Cloud Maps
on the hosted zone list whereas my other app didn't. On recreation the first app added a Cloud Map as well.
This is a very confusing series of events, and I would love to get to the bottom of it. So as far as I understand, you have deployed a LBWS with a network load balancer and HTTP disabled so there's no ALB.
You have a pipeline running which has happily deployed many commits.
You then enabled DNSSEC on the Copilot hosted zone. This caused the next deployment to start failing health checks from the NLB.
You were able to fix the problem by deleting your whole app, then recreating it. From what you've said, it appears that the first app did not enable CloudMap service discovery, which is definitely confusing as we don't provide a way to disable that when creating an environment or service.
What Copilot version were you on when you originally created the problematic app, and what version are you on now?
And are you using the alias
functionality for your DNS? You can specify alternative subdomains for your apps other than what copilot creates for you (svc.env.app.domain.com), which might provide an avenue for switching traffic between different versions of the app. You could also maintain a CNAME record at your app's outward facing address which you point to svc.env.app.domain.com
and switch the record if you need to migrate.
@bvtujo Thanks for getting back to me on this!
Your synopsis is generally correct. I had been deploying to this app/environment frequently for about two months.
I am not sure what happened to the cloud map service discovery. I had a previous app with the same config I deleted and recreated it to use the domain functionality. I wonder if that's where the problem originated.
What I find interesting about the health checks/reachability is that the reachability checks located and translated the external IP to the local IP, but security rules which should have allowed the request didn't.
I don't really have good insight to how the internals of AWS networking work but it seems like DNSSEC created a new instance or instances somewhere in the system and one or more child networking instances (SG for ex) didn't associate.
I am using copilot version v1.25.0 in my pipeline.
I am using the default svc.env.app.domain
pattern. I like your ideas for aliases regarding external systems. We were going to mount the app under our primary domain with something similar.
@bvtujo Hi once again seeing this issue popping up out of no where on the 3 week old deploy, deployed to it succesfully at 1 week intervals. Checking the copilot logs and once again I see no health check requests made.
Why is this happening? Shouldn't health checks be able to reach an AWS instance by default from its own load balancer?
The load balancer in question is arn:aws:elasticloadbalancing:us-east-1:753767448166:targetgroup/smxcor-NLBTa-OUGD1EEHLWBP/2db4dab7dd84d19e
.
It would be very very very helpful if someone from AWS could look at this.
Update: I deactivated the failing service and my other ALB is failing as well without any health checks reaching it.
`arn:aws:elasticloadbalancing:us-east-1:753767448166:targetgroup/smxcor-Targe-JPBYKWYVFH7K/7603934b88b7eaee
I can see the previous version in the logs recieving health checks but the deployed instance is not receiving them. This is a clear indication that something internal at AWS is broken, since the underlying config is the same and managed by ECS. Instance 85e80aca2ded
doesn't receive, while 1a75ac0df8f0
does
copilot/edit/1a75ac0df8f0 16:54:05 [ON][EditSystem->ArangoDB ] live objects 4407
copilot/edit/1a75ac0df8f0 16:54:05 [ON][EditSystem->ArangoDB ] num tx 49
copilot/edit/1a75ac0df8f0 16:54:05 [ON][EditSystem->ArangoDB ] num pend rel 0
copilot/edit/1a75ac0df8f0 16:54:05 [ON][EditSystem->ArangoDB ] active jobs 52
copilot/edit/1a75ac0df8f0 16:54:05 [JSONWebTokenCredentialFactory ] updating security keys...
copilot/edit/1a75ac0df8f0 16:54:06 [twisted.python.log ] "10.0.1.249" - - [16/May/2023:16:54:05 +0000] "GET /health HTTP/1.1" 200 17 "-" "ELB-HealthChecker/2.0"
copilot/edit/1a75ac0df8f0 16:54:20 [ON][ModelManager->sort ] not able to make request True|True|False|0|30.0 vs 12.963716
copilot/edit/1a75ac0df8f0 16:54:30 [UserAccessRealm ] 0 Active Users...
copilot/edit/85e80aca2ded 16:54:32 [UserAccessRealm ] 0 Active Users...
copilot/edit/1a75ac0df8f0 16:54:33 [twisted.python.log ] "10.0.0.14" - - [16/May/2023:16:54:32 +0000] "GET /health HTTP/1.1" 200 17 "-" "ELB-HealthChecker/2.0"
copilot/edit/1a75ac0df8f0 16:54:36 [twisted.python.log ] "10.0.1.249" - - [16/May/2023:16:54:35 +0000] "GET /health HTTP/1.1" 200 17 "-" "ELB-HealthChecker/2.0"
copilot/edit/1a75ac0df8f0 16:55:00 [UserAccessRealm ] 0 Active Users...
copilot/edit/85e80aca2ded 16:55:02 [UserAccessRealm ] 0 Active Users...
copilot/edit/1a75ac0df8f0 16:55:02 [twisted.python.log ] "10.0.0.14" - - [16/May/2023:16:55:01 +0000] "GET /health HTTP/1.1" 200 17 "-" "ELB-HealthChecker/2.0"
copilot/edit/1a75ac0df8f0 16:55:04 [ON][EditSystem.EditSystem ] cur mem: 1.134GB | cpu 60.30%
copilot/edit/85e80aca2ded 16:55:05 [ON][EditSystem->S3 ] live objects 0
copilot/edit/85e80aca2ded 16:55:05 [ON][EditSystem->S3 ] num tx 0
copilot/edit/85e80aca2ded 16:55:05 [ON][EditSystem->S3 ] num pend rel 0
copilot/edit/85e80aca2ded 16:55:05 [ON][EditSystem->S3 ] active jobs 3
copilot/edit/85e80aca2ded 16:55:05 [ON][EditSystem->ArangoDB ] live objects 2569
This second issue seemed to be related to a misconfig in grace period
Having worked through my deployment issues months ago, I was quite satisfied with the copilot pipelines until I tried to deploy a minor update yesterday.
I was shocked to see that my health checks were failing, but when I looked at the logs I didn't actually see any requests being made from the ELB health check system, although I could reach the instance and verify it was listening and working.
The only change I made recently was to add DNSSEC to the domain copilot was setup on. Could this have caused this issue, or is it something more concerning?
Things I checked: