aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.68k stars 3.93k forks source link

[certificate manager] DnsValidatedCertificate deployment fails in cn-north-1 region happened again #16041

Closed qiongwan-Andy closed 3 years ago

qiongwan-Andy commented 3 years ago

Same As: https://github.com/aws/aws-cdk/issues/8697

This is :bug: Bug Report

qiongwan-Andy commented 3 years ago

Hi team,

Currently we are failing to use cdk to create & upsert certificate to route53 in cn-north-1. I noticed that same error happened last year: #8697, and looks like the issue has closed and the might probably resolved. But unfortunately, same error appears again.

        let route53Endpoint = "route53.amazonaws.com.cn";
        const acmCert = new certmgr.DnsValidatedCertificate(this, 'ACMCertificate', {
            domainName: `*.${rootDNS}`,
            hostedZone: hostedZone,
            route53Endpoint: route53Endpoint
        });

log:


2021-08-13T16:59:41.202+08:00 | START RequestId: 052893c8-5852-494c-9a84-21adf29f7be3 Version: $LATEST
-- | --
  | 2021-08-13T16:59:41.358+08:00 | 2021-08-13T08:59:41.321Z 052893c8-5852-494c-9a84-21adf29f7be3 INFO Requesting certificate for *.poplar-bjs-preprod.lychee.aws.a2z.org.cn
  | 2021-08-13T16:59:42.178+08:00 | 2021-08-13T08:59:42.161Z 052893c8-5852-494c-9a84-21adf29f7be3 INFO Certificate ARN: arn:aws-cn:acm:cn-north-1:440641178861:certificate/9635a2f6-9a67-4d8a-9fef-94d1b925a771
  | 2021-08-13T16:59:42.178+08:00 | 2021-08-13T08:59:42.161Z 052893c8-5852-494c-9a84-21adf29f7be3 INFO Waiting for ACM to provide DNS records for validation...
  | 2021-08-13T16:59:45.555+08:00 | 2021-08-13T08:59:45.555Z 052893c8-5852-494c-9a84-21adf29f7be3 INFO Upserting 1 DNS records into zone Z03414709LCT0FB1R4SM:
  | 2021-08-13T16:59:45.555+08:00 | 2021-08-13T08:59:45.555Z 052893c8-5852-494c-9a84-21adf29f7be3 INFO _af7b9d230b81cedf93ebe84689b187f1.poplar-bjs-preprod.lychee.aws.a2z.org.cn. CNAME _43044bddaa039e4b761588469116dad4.kfpsnxvjp.acm-validations.amazonaws.cn.
  | 2021-08-13T16:59:46.100+08:00 | 2021-08-13T08:59:46.100Z 052893c8-5852-494c-9a84-21adf29f7be3 INFO Caught error SignatureDoesNotMatch: Credential should be scoped to a valid region, not 'cn-north-1'. . Uploading FAILED message to S3.
  | 2021-08-13T16:59:46.239+08:00 | END RequestId: 052893c8-5852-494c-9a84-21adf29f7be3

According to log, lambda function throw an exception when calling route53.changeResourceRecordSets, so we failed to Upsert certificate to hostzone. And this issue only happened in BJS(ZHY is totally fine).

Env 1(failed):

CDKBuild 3.x
monocdk 1.115
nodejs 14.x
@amaz/pipeline3.0.0
Typescript4.3.5

Env 2(failed):

CDKBuild 1.45
aws-cdk 1.45
nodejs 12.x
@amaz/pipeline1.45.0
Typescript3.x.x

I also noticed: https://github.com/aws/aws-sdk-js/pull/3274, maybe this is the reason why con-northwest-1 succeeded but cn-north-1 failed.

qiongwan-Andy commented 3 years ago

Moreover, as my understanding route53Endpoint = "route53.amazonaws.com.cn", please correct me if I'm wrong

qiongwan-Andy commented 3 years ago

Lambda function generated:

'use strict';

const aws = require('aws-sdk');

const defaultSleep = function(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
};

// These are used for test purposes only
let defaultResponseURL;
let waiter;
let sleep = defaultSleep;
let random = Math.random;
let maxAttempts = 10;

/**
 * Upload a CloudFormation response object to S3.
 *
 * @param {object} event the Lambda event payload received by the handler function
 * @param {object} context the Lambda context received by the handler function
 * @param {string} responseStatus the response status, either 'SUCCESS' or 'FAILED'
 * @param {string} physicalResourceId CloudFormation physical resource ID
 * @param {object} [responseData] arbitrary response data object
 * @param {string} [reason] reason for failure, if any, to convey to the user
 * @returns {Promise} Promise that is resolved on success, or rejected on connection error or HTTP error response
 */
let report = function(event, context, responseStatus, physicalResourceId, responseData, reason) {
  return new Promise((resolve, reject) => {
    const https = require('https');
    const { URL } = require('url');

    var responseBody = JSON.stringify({
      Status: responseStatus,
      Reason: reason,
      PhysicalResourceId: physicalResourceId || context.logStreamName,
      StackId: event.StackId,
      RequestId: event.RequestId,
      LogicalResourceId: event.LogicalResourceId,
      Data: responseData
    });

    const parsedUrl = new URL(event.ResponseURL || defaultResponseURL);
    const options = {
      hostname: parsedUrl.hostname,
      port: 443,
      path: parsedUrl.pathname + parsedUrl.search,
      method: 'PUT',
      headers: {
        'Content-Type': '',
        'Content-Length': responseBody.length
      }
    };

    https.request(options)
      .on('error', reject)
      .on('response', res => {
        res.resume();
        if (res.statusCode >= 400) {
          reject(new Error(`Server returned error ${res.statusCode}: ${res.statusMessage}`));
        } else {
          resolve();
        }
      })
      .end(responseBody, 'utf8');
  });
};

/**
 * Requests a public certificate from AWS Certificate Manager, using DNS validation.
 * The hosted zone ID must refer to a **public** Route53-managed DNS zone that is authoritative
 * for the suffix of the certificate's Common Name (CN).  For example, if the CN is
 * `*.example.com`, the hosted zone ID must point to a Route 53 zone authoritative
 * for `example.com`.
 *
 * @param {string} requestId the CloudFormation request ID
 * @param {string} domainName the Common Name (CN) field for the requested certificate
 * @param {string} hostedZoneId the Route53 Hosted Zone ID
 * @returns {string} Validated certificate ARN
 */
const requestCertificate = async function(requestId, domainName, subjectAlternativeNames, hostedZoneId, region, route53Endpoint) {
  const crypto = require('crypto');
  const acm = new aws.ACM({ region });
  const route53 = route53Endpoint ? new aws.Route53({endpoint: route53Endpoint}) : new aws.Route53();
  if (waiter) {
    // Used by the test suite, since waiters aren't mockable yet
    route53.waitFor = acm.waitFor = waiter;
  }

  console.log(`Requesting certificate for ${domainName}`);

  const reqCertResponse = await acm.requestCertificate({
    DomainName: domainName,
    SubjectAlternativeNames: subjectAlternativeNames,
    IdempotencyToken: crypto.createHash('sha256').update(requestId).digest('hex').substr(0, 32),
    ValidationMethod: 'DNS'
  }).promise();

  console.log(`Certificate ARN: ${reqCertResponse.CertificateArn}`);

  console.log('Waiting for ACM to provide DNS records for validation...');

  let records;
  for (let attempt = 0; attempt < maxAttempts && !records; attempt++) {
    const { Certificate } = await acm.describeCertificate({
      CertificateArn: reqCertResponse.CertificateArn
    }).promise();
    const options = Certificate.DomainValidationOptions || [];
    if (options.length > 0 && options[0].ResourceRecord) {
      // some alternative names will produce the same validation record
      // as the main domain (eg. example.com + *.example.com)
      // filtering duplicates to avoid errors with adding the same record
      // to the route53 zone twice
      const unique = options
        .map((val) => val.ResourceRecord)
        .reduce((acc, cur) => {
          acc[cur.Name] = cur;
          return acc;
        }, {});
      records = Object.keys(unique).sort().map(key => unique[key]);
    } else {
      // Exponential backoff with jitter based on 200ms base
      // component of backoff fixed to ensure minimum total wait time on
      // slow targets.
      const base = Math.pow(2, attempt);
      await sleep(random() * base * 50 + base * 150);
    }
  }
  if (!records) {
    throw new Error(`Response from describeCertificate did not contain DomainValidationOptions after ${maxAttempts} attempts.`)
  }

  console.log(`Upserting ${records.length} DNS records into zone ${hostedZoneId}:`);

  const changeBatch = await route53.changeResourceRecordSets({
    ChangeBatch: {
      Changes: records.map((record) => {
        console.log(`${record.Name} ${record.Type} ${record.Value}`)
        return {
          Action: 'UPSERT',
          ResourceRecordSet: {
            Name: record.Name,
            Type: record.Type,
            TTL: 60,
            ResourceRecords: [{
              Value: record.Value
            }]
          }
        };
      }),
    },
    HostedZoneId: hostedZoneId
  }).promise();

  console.log('Waiting for DNS records to commit...');
  await route53.waitFor('resourceRecordSetsChanged', {
    // Wait up to 5 minutes
    $waiter: {
      delay: 30,
      maxAttempts: 10
    },
    Id: changeBatch.ChangeInfo.Id
  }).promise();

  console.log('Waiting for validation...');
  await acm.waitFor('certificateValidated', {
    // Wait up to 9 minutes and 30 seconds
    $waiter: {
      delay: 30,
      maxAttempts: 19
    },
    CertificateArn: reqCertResponse.CertificateArn
  }).promise();

  return reqCertResponse.CertificateArn;
};

/**
 * Deletes a certificate from AWS Certificate Manager (ACM) by its ARN.
 * If the certificate does not exist, the function will return normally.
 *
 * @param {string} arn The certificate ARN
 */
const deleteCertificate = async function(arn, region) {
  const acm = new aws.ACM({ region });

  try {
    console.log(`Waiting for certificate ${arn} to become unused`);

    let inUseByResources;
    for (let attempt = 0; attempt < maxAttempts; attempt++) {
      const { Certificate } = await acm.describeCertificate({
        CertificateArn: arn
      }).promise();

      inUseByResources = Certificate.InUseBy || [];

      if (inUseByResources.length) {
        // Exponential backoff with jitter based on 200ms base
        // component of backoff fixed to ensure minimum total wait time on
        // slow targets.
        const base = Math.pow(2, attempt);
        await sleep(random() * base * 50 + base * 150);
      } else {
        break
      }
    }

    if (inUseByResources.length) {
      throw new Error(`Response from describeCertificate did not contain an empty InUseBy list after ${maxAttempts} attempts.`)
    }

    console.log(`Deleting certificate ${arn}`);

    await acm.deleteCertificate({
      CertificateArn: arn
    }).promise();
  } catch (err) {
    if (err.name !== 'ResourceNotFoundException') {
      throw err;
    }
  }
};

/**
 * Main handler, invoked by Lambda
 */
exports.certificateRequestHandler = async function(event, context) {
  var responseData = {};
  var physicalResourceId;
  var certificateArn;

  try {
    switch (event.RequestType) {
      case 'Create':
      case 'Update':
        certificateArn = await requestCertificate(
          event.RequestId,
          event.ResourceProperties.DomainName,
          event.ResourceProperties.SubjectAlternativeNames,
          event.ResourceProperties.HostedZoneId,
          event.ResourceProperties.Region,
          event.ResourceProperties.Route53Endpoint,
        );
        responseData.Arn = physicalResourceId = certificateArn;
        break;
      case 'Delete':
        physicalResourceId = event.PhysicalResourceId;
        // If the resource didn't create correctly, the physical resource ID won't be the
        // certificate ARN, so don't try to delete it in that case.
        if (physicalResourceId.startsWith('arn:')) {
          await deleteCertificate(physicalResourceId, event.ResourceProperties.Region);
        }
        break;
      default:
        throw new Error(`Unsupported request type ${event.RequestType}`);
    }

    console.log(`Uploading SUCCESS response to S3...`);
    await report(event, context, 'SUCCESS', physicalResourceId, responseData);
    console.log('Done.');
  } catch (err) {
    console.log(`Caught error ${err}. Uploading FAILED message to S3.`);
    await report(event, context, 'FAILED', physicalResourceId, null, err.message);
  }
};

/**
 * @private
 */
exports.withReporter = function(reporter) {
  report = reporter;
};

/**
 * @private
 */
exports.withDefaultResponseURL = function(url) {
  defaultResponseURL = url;
};

/**
 * @private
 */
exports.withWaiter = function(w) {
  waiter = w;
};

/**
 * @private
 */
exports.resetWaiter = function() {
  waiter = undefined;
};

/**
 * @private
 */
exports.withSleep = function(s) {
  sleep = s;
}

/**
 * @private
 */
exports.resetSleep = function() {
  sleep = defaultSleep;
}

/**
 * @private
 */
exports.withRandom = function(r) {
  random = r;
}

/**
 * @private
 */
exports.resetRandom = function() {
  random = Math.random;
}

/**
 * @private
 */
exports.withMaxAttempts = function(ma) {
  maxAttempts = ma;
}

/**
 * @private
 */
exports.resetMaxAttempts = function() {
  maxAttempts = 10;
}
njlynch commented 3 years ago

I am able to reproduce the failure; I've also managed to reproduce in a way that doesn't produce the failure, but can't actually verify the certificate would be validated properly as I don't have a valid public domain in Route53 in the cn partition.

Two suggestions:

  1. Remove the route53Endpoint prop from the DnsValidatedCertificate. I believe now that the related aws-sdk-js issue has been pushed, providing this argument is no longer necessary. (I could be wrong about this). Removing the prop for me led to the records being correctly upserted into the zone.
  2. Switch from DnsValidatedCertificate to Certificate. Unless you're providing a region property and requesting the certificate cross-region, the DnsValidatedCertificate adds zero value.

Give one (or both) of those a shot, and let me know if it fixes it.

qiongwan-Andy commented 3 years ago

Hi njlynch@

Thanks for your help, I can successfully Upsert the certificate after removing the route53Endpoint prop.

However, it failed on the timeout to verify the certification issued in ACM.

        const acmCert = new certmgr.DnsValidatedCertificate(this, 'ACMCertificate', {
            domainName: `*.${rootDNS}`,
            hostedZone: hostedZone,
            // route53Endpoint: route53Endpoint
        });

Logs:


2021-08-25T11:22:33.178+08:00 | START RequestId: 0b290e1d-cea9-4970-b6ba-08371854c66f Version: $LATEST
-- | --
  | 2021-08-25T11:22:33.440+08:00 | 2021-08-25T03:22:33.439Z 0b290e1d-cea9-4970-b6ba-08371854c66f INFO Requesting certificate for *.poplar-bjs-preprod-test.lychee.aws.a2z.org.cn
  | 2021-08-25T11:22:34.619+08:00 | 2021-08-25T03:22:34.619Z 0b290e1d-cea9-4970-b6ba-08371854c66f INFO Certificate ARN: arn:aws-cn:acm:cn-north-1:573043591598:certificate/65481376-fa25-4991-836a-a2fec1b24035
  | 2021-08-25T11:22:34.619+08:00 | 2021-08-25T03:22:34.619Z 0b290e1d-cea9-4970-b6ba-08371854c66f INFO Waiting for ACM to provide DNS records for validation...
  | 2021-08-25T11:22:38.699+08:00 | 2021-08-25T03:22:38.699Z 0b290e1d-cea9-4970-b6ba-08371854c66f INFO Upserting 1 DNS records into zone Z03530063FK6GO7DAADLW:
  | 2021-08-25T11:22:38.699+08:00 | 2021-08-25T03:22:38.699Z 0b290e1d-cea9-4970-b6ba-08371854c66f INFO _65ac211a31069466ecbae01747baaa7d.poplar-bjs-preprod-test.lychee.aws.a2z.org.cn. CNAME _b1546f3171fa8ea412ad8732d57b45da.kfpsnxvjp.acm-validations.amazonaws.cn.
  | 2021-08-25T11:22:39.298+08:00 | 2021-08-25T03:22:39.298Z 0b290e1d-cea9-4970-b6ba-08371854c66f INFO Waiting for DNS records to commit...
  | 2021-08-25T11:23:10.117+08:00 | 2021-08-25T03:23:10.117Z 0b290e1d-cea9-4970-b6ba-08371854c66f INFO Waiting for validation...
  | 2021-08-25T11:32:44.983+08:00 | 2021-08-25T03:32:44.983Z 0b290e1d-cea9-4970-b6ba-08371854c66f INFO Caught error ResourceNotReady: Resource is not in the state certificateValidated. Uploading FAILED message to S3.
  | 2021-08-25T11:32:45.084+08:00 | END RequestId: 0b290e1d-cea9-4970-b6ba-08371854c66f
  | 2021-08-25T11:32:45.084+08:00 | REPORT RequestId: 0b290e1d-cea9-4970-b6ba-08371854c66f Duration: 611898.60 ms Billed Duration: 611899 ms Memory Size: 128 MB Max Memory Used: 91 MB Init Duration: 379.56 ms
  | 2021-08-25T11:33:00.976+08:00 | START RequestId: 0161f73d-7cc4-4bbc-abfb-42eb182b1def Version: $LATEST
  | 2021-08-25T11:33:00.981+08:00 | 2021-08-25T03:33:00.981Z 0161f73d-7cc4-4bbc-abfb-42eb182b1def INFO Uploading SUCCESS response to S3...
  | 2021-08-25T11:33:01.043+08:00 | 2021-08-25T03:33:01.043Z 0161f73d-7cc4-4bbc-abfb-42eb182b1def INFO Done.
  | 2021-08-25T11:33:01.045+08:00 | END RequestId: 0161f73d-7cc4-4bbc-abfb-42eb182b1def
  | 2021-08-25T11:33:01.045+08:00 | REPORT RequestId: 0161f73d-7cc4-4bbc-abfb-42eb182b1def Duration: 64.37 ms Billed Duration: 65 ms Memory S

According to the lambda function, looks like 9min30sec is not long enough for validation process:

  console.log('Waiting for validation...');
  await acm.waitFor('certificateValidated', {
    // Wait up to 9 minutes and 30 seconds
    $waiter: {
      delay: 30,
      maxAttempts: 19
    },
    CertificateArn: reqCertResponse.CertificateArn
  }).promise();
qiongwan-Andy commented 3 years ago

About 2h passed, the cert is still pending validation

njlynch commented 3 years ago

Was the cert ever validated?

If not, I'd recommend walking through some of the troubleshooting steps from the Certificate Manager documentation: https://docs.aws.amazon.com/acm/latest/userguide/certificate-validation.html

If you're not already, another thing to consider is creating the Hosted Zone first, setting up the name servers with your domain registrar, and then deploying the certificate. Creating a brand-new hosted zone and validating in the same CloudFormation deployment is almost never going to work given DNS propagation times.

Lastly, if the issue is just needing a longer timeout, I will once again suggest migrating to Certificate, which using an entirely different engine for waiting for the certificate to validate.

github-actions[bot] commented 3 years ago

This issue has not received a response in a while. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.