Miserlou / Zappa

Serverless Python
https://blog.zappa.io/
MIT License
11.89k stars 1.2k forks source link

'certify' doesn't update existing certificates #590

Closed Miserlou closed 7 years ago

Miserlou commented 7 years ago

Somewhere, a regression was introduced, I think this was from a PR that I was iffy about.

Repro: Certify on a new name

Calling certify for environment prod..
Certifying domain blah.yourname.io..
Setting DNS challenge..
Waiting for DNS to propagate..
Deleting DNS challenge..
Updating domain name!
Certificate updated!

Go to domain, old cert still in production.

Miserlou commented 7 years ago

Possibly related:

https://github.com/Miserlou/Zappa/pull/458 https://github.com/Miserlou/Zappa/issues/334

Could also be a change in APIGW behavior itself.

chromakey commented 7 years ago

To +1 this: it seems that none of my certificates have been auto-renewing. I've been having to do it by manually deleting the custom domain name in APIGW and re-deploying/re-certifying. Here's an example zappa_settings.json that I've been using:

{ "production": { "app_function": "flask_server.app", "domain": "www.domain.com", "lets_encrypt_key": "../keys/account.key", "lets_encrypt_expression": "rate(1 day)", "s3_bucket": "proj-prod-zappa", "profile_name": "lambda-deploy", "timeout_seconds": 90 } }

EDIT: Should clarify that while troubleshooting this, I attempted to re-run the certify command with no luck. Running on zappa 0.32.1

Miserlou commented 7 years ago

I think this is actually related: https://github.com/Miserlou/Zappa/issues/391

kwellman commented 7 years ago

+1

This is happening to me too. I noticed that when I run zappa certify the certificate name in api gateway does get updated (e.g. I see the updated timestamped name in the aws console), but the certificate doesn't get updated when I check in my web browser.

I wonder if it's related to this code in the update_domain_name method of zappa.py:

        update_domain_name_response = self.apigateway_client.update_domain_name(
            domainName=domain_name,
            patchOperations=[
                {
                    'op': 'replace',
                    'path': '/certificateName',
                    'value': new_cert_name,
                }
            ]
        )

Is this code just changing the certificate name? Maybe we need to pass the certificate body, etc... to this method. Something like:

      patchOperations=[
                {
                    'op': 'replace',
                    'path': '/certificateName',
                    'value': new_cert_name,
                }
            {"op": "replace",
                "path": "/certificateBody",
                "value": certificate_body},
            {"op": "replace",
                "path": "/certificatePrivateKey",
                "value": certificate_private_key},
            {"op": "replace",
                "path": "/certificateChain",
                "value": certificate_chain},
        ]

...with better formatting of course. I got this code searching Github so it's likely supported. I don't have time to test this at the moment, but within the next week I could try it unless this issue is resolved or someone else can try it.

kwellman commented 7 years ago

Unfortunately, this is not possible like I had hoped. Only the certificate name can be updated (https://docs.aws.amazon.com/apigateway/api-reference/link-relation/domainname-update/). I suppose this problem was known when originally writing the update_domain_name method for Zappa.

I'm not sure how to get around this programmatically, short of creating a new api gateway domain with the new certificate, which may involve some downtime if we have to delete the current api gateway domain first. Hopefully, there is a better way. I will test with an old version of Zappa to determine when this regression occurred, although I always had problems with re-certifying my domain.

Currently, I'm updating my certificate by using the undocumented Zappa features to output the certificate data, and pasting that data into the aws console as described at http://docs.aws.amazon.com/apigateway/latest/developerguide/how-to-custom-domains.html#how-to-rotate-custom-domain-certificate.

P.S: The Terraform project also encountered this problem (https://github.com/hashicorp/terraform/issues/8789)

chromakey commented 7 years ago

Yeah I'm not sure how to go about this. Just by virtue of the way APIGW/CF works, any time the certificate is changed on the fly, there's going to be downtime while it's pushed out to the edge nodes in CF.

The AWS docs have something on uploading a backup certificate and then rotating it into production gradually to avoid the downtime issue. I don't know what aspect of this are exposed in the API, but maybe this will give us some clues on where to start looking.

http://docs.aws.amazon.com/apigateway/latest/developerguide/how-to-custom-domains.html#how-to-rotate-custom-domain-certificate

EDIT: The example here has you update the certificate NAME by specifying the /certificateName path: http://docs.aws.amazon.com/apigateway/api-reference/link-relation/domainname-update/

I wonder if you could use the same method to update the certificateBody, certificatePrivateKey, and certificateChain parameters by specifying them as paths (i.e. /certificateBody) http://boto3.readthedocs.io/en/latest/reference/services/apigateway.html#APIGateway.Client.create_domain_name

Just guessing blindly. Worth a shot maybe.

kwellman commented 7 years ago

@chromakey

If only it were that simple...I tried that previously but it didn't work. This is the output I got:

Setting DNS challenge..
Waiting for DNS to propagate..
Deleting DNS challenge..
Updating domain name!
An error occurred (BadRequestException) when calling the UpdateDomainName operation: Invalid patch path  '/certifica
teBody' specified for op 'replace'. Must be one of: [/certificateName]
Failed to generate or install certificate! :(
chromakey commented 7 years ago

Ah it was worth a shot. I can't find anything in the documentation that points to the backup certificate functionality being exposed at all. The best we could hope for I think is that maybe it's undocumented and uses parameters like backupCertificateBody or something like that.

Supposedly, AWS has been working on getting ACM to work with API Gateway, but there's no public time table at the moment: https://forums.aws.amazon.com/thread.jspa?threadID=234686

Maybe we should disable the renewal feature for the time being since from what I can tell it doesn't work as intended and it might catch people off guard when their 90 days is up (speaking from experience) or at least add a warning of sorts.

I'll keep looking as time allows, but unfortunately it doesn't look good at the moment.

EDIT: Well there it is in black and white, I just missed it the first time in the APIGW docs - "You cannot rotate custom domain name certificates programmatically."

kwellman commented 7 years ago

Personally I'm okay with renewing the certificate manually through the AWS console. After all, it's something you only have to do after 90 days. Of course, this should be made very clear in the documentation along with detailed instructions.

One possible option to get the certificate updated programmatically would be to generate a new api gateway domain specifically for the new certificate and then switch the route53 domain to use the new api gateway. Eventually, after a sufficient amount of time for everything to propagate we'd delete the old api gateway. It's complicated but I believe this could all be done automatically in code. Just an idea. Of course, a simpler way would be much preferred.

aectann commented 7 years ago

@Miserlou Had to renew all our certificates manually today (4 out of 5 expired). So it looks like it was not working properly for a while?

Miserlou commented 7 years ago

One of the issues was that the documentation was wrong, sorry about that. It's lets_encrypt_expression not lets_encrypt_schedule. There is also a potential race condition that still needs to be addressed.

aectann commented 7 years ago

@Miserlou hmm.. but that would only affect the creation of a cloudwatch trigger, won't it? I did see certificates expired even with the trigger present.

aectann commented 7 years ago

Also, I did try to just run certify locally with no effect on the certificate (expired one wasn't replaced).

Miserlou commented 7 years ago

Yeah, it needs to be chased down. I think there was a PR that introduced this bug although I can't confirm that. I think this is related: https://github.com/Miserlou/Zappa/issues/391

NathanLawrence commented 7 years ago

Welp...this one just came around for me again on a different project. I'm really eager to help out with any solutions on this front. What can we do?

Miserlou commented 7 years ago

First step is diagnosis!

NathanLawrence commented 7 years ago

Roger. I'm going to crack down on this one bigtime, so I should have a PR by week's end.

Miserlou commented 7 years ago

Fantastic news.

NathanLawrence commented 7 years ago

All right, now that I'm actually getting down to business here, I can't find any way to recreate the issues people are concerned about on a fresh project. It looks to me like when certify is called, the certificate is correctly replaced with a new one, and the ANSI date in its same name seems to confirm this.

Apart from how long it takes for new certs to reach edge nodes, which is not ours to change, what are we looking to improve here?

Miserlou commented 7 years ago

How long did it take your update to reach "edge nodes"?

NathanLawrence commented 7 years ago

It hasn't yet, as far as I can tell, and that's a good point -- until it arrives and that expiration date changes, I really won't have any way to tell for certain if we're sending through fresh certs or just changing the name displayed on the AWS UI and nothing else.

Miserlou commented 7 years ago

Bump - did it propagate?

NathanLawrence commented 7 years ago

Looks like it!

Period of Validity Begins on: February 3, 2017 Expires on: May 4, 2017

pjz commented 7 years ago

So now if I have an existing site that didn't update, what do I do?

NathanLawrence commented 7 years ago

Well nothing this second since Amazon S3 is down, but...

I would update your local tools and try running certify again. If that doesn't fix the problem, it may be time for more drastic measures. You can always undeploy, go through the admin interface to make sure everything is gone, then deploy and certify like it's fresh.

Miserlou commented 7 years ago

I went around this in a giant circle today. Head -> Wall -> bang

New theory is we can remove the API Gateway entry for the domain and recreating it but leave the Route53 domain name in tact, updating the record. Expecting this will cause down time between 0 and 40 minutes. >:[

chromakey commented 7 years ago

Yeah given how AWS chose to not expose the backup certificate functionality in the API for APIGW, I don't see any other way. Supposedly they're going to integrate ACM with APIGW at some point, but no one knows if that's tomorrow or 10 years from now.

maxtheman commented 7 years ago

I've been writing a little side project (for work) in flask using zappa, so this is my first go-around running zappa certify, and three weird things happened when I ran it which broke my project.

What I did:

  1. First tried to deploy/certify a totally new environment called prod. Certify ran "successfully" and generating the correct (I think) records in the api gateway for custom domain. It did not create any records in route 53. Running certify for the environment again produced issue referenced in #334.

  2. Attempting to access my index url in prod results in a redirect from {{aws_url}}/prod/ => {{aws_url}}/ and a resulting 403 forbidden is thrown, I'm assuming from the Cloudflare distribution (which doesn't exist in my Cloudflare dashboard, but does in the APIGW, oddly enough)

  3. If I go to one of my APIGW endpoints that should be accessible (say, prod/login) I am served the html, but not any of the static files, including the linked css (although the css shows up in dev tools, it's an empty file).

Note that this is the first update to prod I attempted after upgrading zappa to the most recent version (0.37.2) and the only certify I've done.

Fixes I've attempted that didn't work:

Steps for fix I attempted that did work:

  1. Undeploying, Removing "domain" from zappa_settings.json
  2. Removing the custom domain associated with dev in APIGW
  3. Redeploying immediately fixed (2) and (3)

Let me know if you have questions or trouble reproducing any of this. I need to get it done, so I'll just setup the custom domains by hand for now.

PS -- Sorry if I'm overloading this single issue, happy to split off into a different thread if appropriate.

Miserlou commented 7 years ago

So I just tried my proposed solution up there ^, expecting a between 0 and 40 minute downtime, and it seems like the change is in fact instantaneous. So, it's not as elegant as actually rotating the certificate, but the damage isn't as bad as I imagined. <1 minute * 4 times a year - that's still quite a lot of 9's of uptime.

Miserlou commented 7 years ago

Update: This does NOT work as intended. Gosh dang it.

It works, I was able to see the page at the domain, immediately after calling certify, THEN the distribution goes down, once the new distribution has been created.

So, as it stands, we can either have it work immediately, but then it breaks 40 minutes later, OR we can have it break immediately but it works 40 minutes later. Very annoying!

Obviously the latter is preferred, but is there an opportunity for hacks here? Can we do some kind of domain name dance to ensure uptime?

pjz commented 7 years ago

What's taking the 40 minutes? API Gateway? or route53? Is it possible to leave the old one in place but make a new one and then make the changeover but not nuke the old one for another 40 minutes? Basically: keep them both in existence until the changeover time is complete, so there's never downtime - the site is always served with either the old one or the new one.

Miserlou commented 7 years ago

That'd be ideal, but would require either an extremely long running process, or we set up a scheduled function to do it from AWS, which might not work for everybody's setup.

In the meantime, I've just added a --manual mode mode so you can rotate the cert without any downtime.

pjz commented 7 years ago

This is ideally all happening via the recertify job, right? So it's already a long running process?

kwellman commented 7 years ago

Unless I'm missing something, it seems like something that could be done with no downtime. In the certify handler set up a new API gateway, leave the old one, point the route 53 domain to the new API gateway. The old API gateway could be deleted the next time the certify handler is called.

chromakey commented 7 years ago

Will this work since APIGW doesn't allow you to have two of the same custom domain names in the system at any given time?

kwellman commented 7 years ago

That's a very good point. Also, now that i think about it we would also need to ensure that the API gateway is fully propagated before pointing the route53 domain to it to guarantee no downtime.

chromakey commented 7 years ago

To give a sense of my manual process to work around this for the moment, here's what I've had to do.

  1. Delete the domain via the console from APIGW.
  2. Wait about 20 minutes (if you try to do something before the CloudFront edge nodes are updated, it will throw an error saying the domain already exists)
  3. Re-certify my project using zappa certify production (or equivalent).
  4. Wait another 20 minutes.

CloudFront warns you to expect that domain deletion/creation can take up to 40 minutes each way, but it's been my experience that it's usually in the 20-25 minute range. I have about 15 sites using Zappa at the moment (thankfully none that are SLA sensitive) and this is the method I've used successfully since we discovered the issue.

I'm not sure how we can clear this hurdle without:

  1. AWS exposing certificate rotation for APIGW to the API.
  2. AWS allowing ACM to run with APIGW.
  3. AWS allowing multiple instances of the same custom domain name (to use as a placeholder).
  4. Something else?
Miserlou commented 7 years ago

@chromakey - if your current strategy already involves manual deletion, you can use the new --manual feature to use the console to update the cert to avoid downtime. The updated certify behavior will do a (reordered) version of your steps automatically.

chromakey commented 7 years ago

Sweet. I'll take it for a test drive in the next couple days.

chromakey commented 7 years ago

Haven't looked into this yet, but looks like we may have a path to success

https://aws.amazon.com/about-aws/whats-new/2017/03/amazon-api-gateway-integrates-with-aws-certificate-manager-acm/

Miserlou commented 7 years ago

Yep! Currently open for discussion here: https://github.com/Miserlou/Zappa/issues/710