aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.59k stars 3.89k forks source link

(certificatemanager): DnsValidatedCertificate timeout while waiting for certificate approval #2914

Closed KnisterPeter closed 1 year ago

KnisterPeter commented 5 years ago

Describe the bug Creating certificates via certificate manager and route54 DNS validation fails with a timeout. Error message:

Failed to create resource. Resource is not in the state certificateValidated

Expected behavior The lambda waiting for the approval should probably wait more than the hardcoded 5 minutes right now.

Version:

NGL321 commented 5 years ago

Could you give some more steps to how you got to the error message?

KnisterPeter commented 5 years ago

Sure, I've used this code fragment:

    new certificatemanager.DnsValidatedCertificate(this, 'id', {
      domainName: 'some-name',
      hostedZone: zone
    })

And during cdk deploy the above error was thrown after some time. When I looked in to certificate manager console then, I saw that the requested certificate was indeed still in pending validation.

Therefore I think its a timing issue, and in the lambda code of the dns validation there is a wait statement for 5 minutes. If I'm right this may be a bit too short.

https://github.com/awslabs/aws-cdk/blob/master/packages/%40aws-cdk/aws-certificatemanager/lambda-packages/dns_validated_certificate_handler/lib/index.js#L142

RomainMuller commented 5 years ago

The runtime for the whole execution may not exceed 15 minutes. The function is currently waiting for up to 5 minutes for the DNS record to commit, then waits up to 5 minutes for the ACM validation to happen.... That does not leave much margin.

KnisterPeter commented 5 years ago

@RomainMuller Thanks, that will probably help in a lot of situations. Unfortunately the certificate manager claims to approve pending certificate requests in at least 30 minutes. So there is still a lot of room to fail. But I think this will help a lot.

hagihala commented 4 years ago

Lately, certificate validation often takes more than 10 minutes. In the worst case it took about 42 minutes, as far as I tested. It would be better if the waiter params could be specified in DnsValidatedCertificateProps.

Screen Shot 2020-01-16 at 19 57 50
John-Cass commented 4 years ago

Still a problem; Requested at 2020-01-16T10:33:04UTC Issued at 2020-01-16T10:46:21UTC Can the delay duration be a variable so we can specify a value?

miaekim commented 4 years ago

@RomainMuller Can we increase the validation timeout value? WhenI try to write DNS record manually in AWS Console, I got following message.

The DNS record was written to your Route 53 hosted zone. It can take 30 minutes or longer for the changes to propagate and for AWS to validate the domain and issue the certificate.

Screen Shot 2020-02-03 at 11 45 50 AM

ACM uses 72 hours as their validation timeout.

If ACM is not able to validate the domain name within 72 hours from the time it generates a CNAME value for you, ACM changes the certificate status to Validation timed out.

geofflittle commented 4 years ago

@RomainMuller I'm currently running into this issue / this should remain open.

Birowsky commented 4 years ago

Why is this closed? What's the consensus for solving this?

lemiesz commented 4 years ago

I dont understand how increasing the wait time to 9mins was a valid solution? That does not solve the problem at all.

davidsteed commented 4 years ago

This is still a problem. A better error message would help. Like certificate request pending please re-run once complete

ronnypmuliawan commented 4 years ago

Any workaround for this? I was able to successfully deploy my CDK to a few customers before hitting this error today.

opentrail commented 4 years ago

I have this error too. Still not fixed in CDK 1.55.0

aleskozina commented 4 years ago

hey guys i found an anomaly:

new DnsValidatedCertificate(this, 'id', {
            domainName: 'domainname',
            hostedZone: zoneObject,
            region: 'us-east-1',
            validation: CertificateValidation.fromDns(zoneObject)
});

With this code, the cname record gets generated and added to the provided hostedzone. this works. but i compared the values from the automatically added cname record with the one that is downloadable available from the GUI in ACM. the NAME of the cname record is different: the name from the generated cname record is missing a point (.) at the end of the name. the NAME of the cname record in the downloaded has the point (.) in its name.

just an observation. i actually cant test it by myself because the new GUI dont let me add a fully customizable name and the button in the ACM GUI which adds it automatically also trims the point (.) away.

EDIT: when i switched to the old console, the dot is appearing in the cname... so my assumption is incorrect.

NGL321 commented 4 years ago

Looks like this is getting plenty of attention. I am reopening it

shivlaks commented 4 years ago

@opentrail - is it the same issue? can you help me with a minimal repro here? - I'd like to make sure that it's the same issue since we're reopening

    new certificatemanager.DnsValidatedCertificate(this, 'id', {
      domainName: 'some-name',
      hostedZone: zone
    })

I gave this snippet a shot in a couple of regions (us-east-1, us-west-2, eu-west-1, ca-central-1) and haven not been able to reproduce the error conditions.

can you point me towards the snippet you're using and any region details?

opentrail commented 4 years ago

Hi Shiv,

It looks as though this is caused by missing NS records in Route53 for the domain in the cross account where we are adding alias/cname records.

Thanks,

Jonathan

On Thu, 20 Aug 2020 at 07:45, Shiv Lakshminarayan notifications@github.com wrote:

@opentrail https://github.com/opentrail - is it the same issue? can you help me with a minimal repro here? - I'd like to make sure that it's the same issue since we're reopening

new certificatemanager.DnsValidatedCertificate(this, 'id', {
  domainName: 'some-name',
  hostedZone: zone
})

I gave this snippet a shot in a couple of regions (us-east-1, us-west-2, eu-west-1, ca-central-1) and haven not been able to reproduce the error conditions.

can you point me towards the snippet you're using and any region details?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aws/aws-cdk/issues/2914#issuecomment-677335767, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE7DPXYEKYOXFV5YWJHLN3SBTA7PANCNFSM4HZBGLBQ .

-- Jonathan Greenwood

knowsuchagency commented 4 years ago

@shivlaks I have a specific example you might try where I've been running into this issue https://github.com/knowsuchagency/airflow-cdk/pull/2

savnik commented 3 years ago

Any known work around on this issue? This is still an issue in aws-cdk version 1.64.1.

acdoussan commented 3 years ago

I had this same issue happen, and it turned out that my domain had a different set of name servers than the created hosted zone.

To fix it manually: You can update the name servers for a domain to match the hosted zone in the top right of the domain information on the R53 console (on the left menu click on "registered domains" then click on your domain in the list).

AWS docs for updating name servers here: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/domain-name-servers-glue-records.html

As for the CDK, the HostedZone construct should probably be updated to use the name servers that the domain is configured for so that multiple hosted zones can be created for the same domain.

It is also worth noting that I had transferred the domain from a different AWS account, and had no existing hosted zones. Not sure how the existing implementation determines what name servers to use for a hosted zone, but maybe this is why it is failing to use the correct ones?

ilko-rbi commented 3 years ago

I'm hitting this also again and interestingly it seems that it depends on the region - we are using eu-central-1 for everything besides the cognito certificates (they must be issued in us-east-1 for custom domains). In eu-central-1 the approval goes through in sec/mins - for us-east-1 it takes hours. I don't know what could be the follow up problems, but what if we add an option to skip the validation of the certificate issue status - is this possible at all?

oliverbenns commented 3 years ago

I had this issue too because my hosted zone is also created in the same deploy. It doesn't give me enough time to update the DNS values in my DNS registrar. Nor is there a prompt to do so.

I solved this by splitting the deploys out into 2 where I create the hosted zone, manually update the records and then later on deploy.

There is also an override method but I am unable to get the ns record for the first arg (a method on the hosted zone returns the values but not the s3 record), not entirely sure if this is implemented as the Go sdk is experimental: https://github.com/aws/aws-cdk-go/blob/awscdk/v1.102.0-devpreview/awscdk/awsroute53/awsroute53.go#L7416

Fitmavincent commented 3 years ago

What is the solution for this issue now? As the ACM timing out causing the rollback of the whole cdk stack that I'm deploying. And I need to add a cert into the CloudFrontConfig.

Tebza17 commented 3 years ago

Here is a potential solution: Make sure that your Hosted Zone (the one you are writing the CNAME record-set to) is registered. Meaning: when you type in the "zone name" (i.e vincent.subdomain.domain.co.za) of the Hosted Zone in NSLookup it should return the 4 name servers. If it does not, then you cannot validate a certificate with that domain name (hosted zone)

papiro commented 3 years ago

@Tebza17 if it's not registered, how does one go about registering it?

Tebza17 commented 3 years ago

@papiro This AWS page is how.

The answer you want is in there

abend-arg commented 2 years ago

For those running with this problem, use instead the Certificate construct. It allows you to achieve the very same thing without time limit. Something like this:

        const certificate = new acm.Certificate(this, `${PREFIX}LandingPageAcmCertificate`, {
            domainName: SITE_DOMAIN,
            subjectAlternativeNames: [`www.${SITE_DOMAIN}`],
            validation: acm.CertificateValidation.fromDns(rootHostedZone)
        });
njlynch commented 2 years ago

For those experiencing this issue:

Unless you absolutely need cross-region certificate issuance (e.g., requesting a us-east-1 certificate from another region for CloudFront), then converting to use the Certificate construct (as @AbendGithub notes above) is your best bet. The Certificate construct does not have the same time-out constraints as DnsValidatedCertificate and uses CloudFormation's internal workflow system for provisioning and validating.

If you must use DnsValidatedCertificate, give yourself the best possible chance of success by creating and deploying your Route53 HostedZone first, validating the domain with tools like dig, nslookup, etc., and only then adding the certificate to the deployment. See https://docs.aws.amazon.com/acm/latest/userguide/troubleshooting-DNS-validation.html for a list of common DNS validation troubleshooting tips. In particular, if something like % dig yourhostname.example.com does not return the 4 name servers associated with your hosted zone prior to starting the deployment, your certificate will never validate.

BillyBunn commented 2 years ago

@njlynch Unfortunately I'm experiencing the same timeout issue, even with the Certificate construct. I've tried using both.

DnsvalidatedCertificate timed out after a few minutes with

CREATE_FAILED | AWS::CloudFormation::CustomResource 
Received response status [FAILED] from custom resource. Message returned: Resource is not in the state certificateValidated
... stacktrace

Certificate timed out after a few hours with

CREATE_FAILED | AWS::CertificateManager::Certificate 
Certificate is in PENDING_VALIDATION status
... stacktrace

Also, both ways are unable to delete the failed stack because of DNS record sets created in the same deployment that pointed at a CloudFront alias (probably should be a separate issue).

DELETE_FAILED | AWS::Route53::HostedZone
The specified hosted zone contains non-required resource record sets  and so cannot be deleted.

Ran into this trying to deploy a static site (S3 bucket, CloudFront distribution, Route53 hosted zone, ACM certificate) with a domain registered already with Route53. I have noticed also what @acdoussan mentioned—the name servers for the registered domain do not match the hosted zone NS records made by PublicHostedZone.

Anything obvious that is causing this? My code:

    const websiteBucket = new s3.Bucket(this, "WebsiteBucket", {
      autoDeleteObjects: true,
      publicReadAccess: false,
      removalPolicy: cdk.RemovalPolicy.DESTROY,
    });

    const websiteHostedZone = new route53.PublicHostedZone(this, "WebsiteHostedZone", {
      zoneName: 'domain-name.com',
    });

    // Have also tried `DnsValidatedCertificate`
    const websiteCertificate = new certificateManager.Certificate(this, "WebsiteCertificate", {
      domainName: 'domain-name.com',
      subjectAlternativeNames: ['www.domain-name.com'],
      validation: certificateManager.CertificateValidation.fromDns(websiteHostedZone),
    });

    const websiteBucketDistribution = new cloudfront.Distribution(this, "WebsiteBucketDistribution", {
      certificate: websiteCertificate,
      defaultBehavior: {
        origin: new origins.S3Origin(websiteBucket),
        viewerProtocolPolicy: cloudfront.ViewerProtocolPolicy.REDIRECT_TO_HTTPS,
      },
      defaultRootObject: "index.html",
      domainNames: ['domain-name.com'],
    });

    new route53.ARecord(this, "WebsiteARecord", {
      target: route53.RecordTarget.fromAlias(new targets.CloudFrontTarget(websiteBucketDistribution)),
      recordName: 'domain-name.com',
      zone: websiteHostedZone,
    });

    new route53.AaaaRecord(this, "WebsiteAAAARecord", {
      target: route53.RecordTarget.fromAlias(new targets.CloudFrontTarget(websiteBucketDistribution)),
      recordName: 'domain-name.com',
      zone: websiteHostedZone,
    });

Edit: Can recreate with simply this

    const websiteHostedZone = new route53.PublicHostedZone(this, "WebsiteHostedZone", {
      zoneName: 'domain-name.com',
    });

    // Have also tried `DnsValidatedCertificate
    const websiteCertificate = new certificateManager.Certificate(this, "WebsiteCertificate", {
      domainName: 'domain-name.com',
      subjectAlternativeNames: ['www.domain-name.com'],
      validation: certificateManager.CertificateValidation.fromDns(websiteHostedZone),
    });
ekeyser commented 2 years ago

I notice that when the zones are for domains that have not been purchased (lack NS registrar records) this happens. I suppose that makes some sort of sense since we're talking about domain ownership. I was doing testing and didn't want to buy a domain just for testing some cdk/cloudformation code. Maybe this note will help someone. Just sayin'.

dpistole commented 2 years ago

@BillyBunn, might be a long shot, but I switched to Certificate and my deploy started hanging as well. I never let it time out but I noticed in my gmail spam folder I had a bunch of emails from AWS re: Certificate Approval with a link that I had to click to approve the certificate. I marked them not as spam and tried again; clicking the approve link seemed to do the trick.

I switched back to the DNS validated cert afterward, and that one seems to work if I wait for the hostedZone to get created, then use its name servers to update the name servers section under registered domains via the UI. The deploy hangs while I do that but then seems to finish up.

lehotskysamuel commented 2 years ago

I'm sorry but I believe this can only be properly fixed by Amazon internal team.

The problem is that DnsValidatedCertificate works by creating a custom resource with lambda that adds those records and then waits for validation. But since this is a lambda, there is a max run time of 15 minutes. Yet based on comments above, validating certificates may take hours on us-east-1. I've been currently waiting on validation for 49 minutes and it's still not validated.

As to why we have to use the DnsValidatedCertificate: We are a team in Europe, with our main region being Ireland: eu-west-1. There are many certificates that require certs placed in N. Virginia: us-east-1. That rules out the regular acm.Certificate class because that class will only deploy to the main region.

We also don't want a separate stack that deploys into us-east-1 because then you cannot export certificate ARN and import it into another stack. Fn::importValue only works within the same region.

Workarounds: The only workaround right now is to deploy it in a separate stack into us-east-1, then have a second stack that exports certificate values which are hard-coded as strings (manual step) and then have a third stack which actually uses those values.

One other workaround is to retry stack deployment early in the morning when it seems to get validated in time - but that is highly unreliable.

Solutions: Well ideally you could internally push for making certificate validations faster in that region and guarantee validations under 15 minutes. Or implement an API to do cross-region certificate creations, so CloudFormation would support this scenario natively (without the lambda). Or don't force us to deploy certificates to a specific region (us-east-1), then we could all happily use the acm.Certificate class.

I've never really used CustomResource, so don't know much about that. But is there a way to run something else than a lambda that might run for longer?

If you can't do any of that, you could at least make the stack deployments idempotent. Problem is that the custom resource lambda fails and triggers a rollback, which orphans the certificate and new re-deployment doesn't use the original cert that might be already validated. There would be no problem if I could: deploy a stack, wait for it to fail due to lambda timeout, wait until certificate is valdiated, re-deploy - and it will pickup the original certificate and successfully complete.

Does it really need to fail and trigger rollback? How come the main acm.Certificate within one region works?

At the very least this issue should be documented on the cdk page for DnsValidatedCertificate construct.

skrud-dt commented 1 year ago

I think should be fixable by using CustomResourceProvider instead of just a straight CustomResource. Then the awaiting could happen using isComplete instead of inline right after creating the certificate. Can CDK constructs use CustomResourceProvider?

(I may be able to take as tab at this)

github-actions[bot] commented 1 year ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.