aws-quickstart / cdk-eks-blueprints

AWS Quick Start Team
Apache License 2.0
447 stars 198 forks source link

(external-secrets addon): allow specifying `wait` property when deploying the helm chart so the `webhook` pod can stabilize #509

Closed Mr-istov closed 1 year ago

Mr-istov commented 1 year ago

Describe the bug

When deploying the ExternalSecrets addon together with a ClusterSecretStore or any resource that is of type ExternalSecrets CRD for the first time the deployment fails with the following error:

Received response status [FAILED] from custom resource. Message returned: Error: b'Error from server (InternalError): error when creating "/tmp/manifest.yaml": Internal error occurred:
failed calling webhook "validate.clustersecretstore.external-secrets.io": Post "https://external-secrets-webhook.external-secrets.svc:443/validate-external-secrets-io-v1beta1-clusterse
cretstore?timeout=5s": no endpoints available for service "external-secrets-webhook"\n'

I do have a dependency between the ClusterSecretStore and the ExternalSecrets addon, but the error persists. It seems other users had the same issue: https://stackoverflow.com/questions/73711481/in-cdk-can-i-wait-until-a-helm-installed-operator-is-running-before-applying-a

The error happens only when deploying the ExternalSecrets for the first time together with some custom resource that requires it to be up and running. If the ExternalSecrets is deployed from a previous deployment, the next deployments can add custom resources.

Expected Behavior

ExternalSecrets deployed and running, together with the custom resource ClusterSecretStore added.

Current Behavior

Cloudformation deployment fails with the error from above.

Reproduction Steps

Create a new stack, add the addon and create a custom resource, then deploy the stack.

Possible Solution

Allow specifying the wait property when deploying the ExternalSecrets addon, so it can be set to true: https://github.com/aws-quickstart/cdk-eks-blueprints/blob/main/lib/addons/external-secrets/index.ts#L93

Additional Information/Context

I have tried modifying the code locally by including the wait property and setting it to true as suggested from StackOverflow it worked.

CDK CLI Version

2.43.0 (build 487870a)

EKS Blueprints Version

1.3.0

Node.js Version

v18.8.0

Environment details (OS name and version, etc.)

macOS 12.6

Other information

cc @pflorek Adding @pflorek as cc here since I can see that he created the ExternalSecrets addon and PR #480

elamaran11 commented 1 year ago

@Mr-istov Could you please share us your code snippet of your stack for us to take a look and recreate the Scenario?

Mr-istov commented 1 year ago

@Mr-istov Could you please share us your code snippet of your stack for us to take a look and recreate the Scenario?

Hey @elamaran11, sorry for the late response. Yes, it's not a problem, I was just testing out the @aws-quickstart/cdk-eks-blueprints module, so I have an example test stack. This is the code:

lib/stack.ts:

import { StackProps, aws_eks, aws_ec2 } from "aws-cdk-lib";
import { Construct } from "constructs";
import * as blueprints from "@aws-quickstart/eks-blueprints";

export class ExternalSecretsStoresAddOn implements blueprints.ClusterAddOn {
  @blueprints.utils.dependable("ExternalsSecretsAddOn")
  deploy(clusterInfo: blueprints.ClusterInfo): void | Promise<Construct> {
    const cluster = clusterInfo.cluster;

    const manifest = [
      {
        apiVersion: "external-secrets.io/v1beta1",
        kind: "ClusterSecretStore",
        metadata: {
          name: "secrets-manager-provider",
        },
        spec: {
          provider: {
            aws: {
              service: "SecretsManager",
              region: cluster.stack.region,
            },
          },
        },
      },
    ];

    const manifestNode = new aws_eks.KubernetesManifest(
      cluster.stack,
      "secrets-manager-provider-manifest",
      {
        cluster,
        manifest,
        overwrite: true,
      }
    );

    return Promise.resolve(manifestNode);
  }
}

export class CdkEksTempStack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    const externalSecretsAddon = new blueprints.addons.ExternalsSecretsAddOn();
    const externalSecretsStores = new ExternalSecretsStoresAddOn();

    const addons: Array<blueprints.ClusterAddOn> = [
      externalSecretsAddon,
      externalSecretsStores,
    ];

    const managedNodeGroup = new blueprints.MngClusterProvider({
      id: "main-nodegroup",
      version: aws_eks.KubernetesVersion.of("1.22"),
      instanceTypes: [new aws_ec2.InstanceType("t3.large")],
    });

    blueprints.EksBlueprint.builder()
      .addOns(...addons)
      .clusterProvider(managedNodeGroup)
      .build(scope, id, props);
  }
}

bin/main.ts:

#!/usr/bin/env node
import "source-map-support/register";
import * as cdk from "aws-cdk-lib";
import { CdkEksTempStack } from "../lib/stack";

const app = new cdk.App();
new CdkEksTempStack(app, "CdkEksTempStack");
Mr-istov commented 1 year ago

After running npm run cdk deploy on the above Stack for the first time, this is the error that arises in CloudFormation:

Screenshot 2022-09-27 at 13 32 26

pflorek commented 1 year ago

@Mr-istov Thank you! Yes, in the other clusters I had to configure the chart to wait for the ready state.

I've opened the PR for the fix.

Mmh, it seems the props for wait and timeout are not to be meant as user props in HelmAddOnUserProps which are in HelmChartDeployment and internally used... ... I guess in the meantime it's only deployable by two steps. First deploy ESO, then the stores and secrets.

Mr-istov commented 1 year ago

@Mr-istov Thank you! Yes, in the other clusters I had to configure the chart to wait for the ready state.

I've opened the PR for the fix.

Mmh, it seems the props for wait and timeout are not to be meant as user props in HelmAddOnUserProps which are in HelmChartDeployment and internally used... ... I guess in the meantime it's only deployable by two steps. First deploy ESO, then the stores and secrets.

Thank you @pflorek for opening a PR to address this. Yeah, it's not a problem for now to deploy it in two steps until your PR is merged.