awslabs / data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
https://awslabs.github.io/data-on-eks/
Apache License 2.0
601 stars 205 forks source link

failed calling webhook "mservice.elbv2.k8s.aws" #458

Open mayurbhagia opened 6 months ago

mayurbhagia commented 6 months ago

Installing Spark Operator with YuniKorn on Cloud9 in my AWS account and install.sh is ending with below two errors:

Error: 2 errors occurred: │ Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": dial tcp 100.64.184.123:9443: connect: connection refused │ Internal error occurred: failed calling webhook "mservice.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-service?timeout=10s": dial tcp 100.64.184.123:9443: connect: connection refused

vara-bonthu commented 6 months ago

I think it's a timing issue. if you try to run terraform apply or rerun install.sh again then it should fix the issue.

Please feel free to update troubleshooting guide https://github.com/awslabs/data-on-eks/blob/main/website/docs/blueprints/troubleshooting/troubleshooting.md if the issue resolved by the above approach.

raykrueger commented 6 months ago

This consistently requires two executions of install.sh currently.

raykrueger commented 6 months ago

I'm betting we need to bump up that 10s timeout, but currently we'd be blocked on... https://github.com/kubernetes-sigs/aws-load-balancer-controller/issues/2711

askulkarni2 commented 6 months ago

This is due to a mutating webhook introduced for LBC v2.5+. Per the docs...

The AWS LBC provides a mutating webhook for service resources to set the spec.loadBalancerClass field for service of type LoadBalancer on create. This makes the AWS LBC the default controller for service of type LoadBalancer. You can disable this feature and revert to set Cloud Controller Manager (in-tree controller) as the default by setting the helm chart value enableServiceMutatorWebhook to false with --set enableServiceMutatorWebhook=false . You will no longer be able to provision new Classic Load Balancer (CLB) from your kubernetes service unless you disable this feature. Existing CLB will continue to work fine.

If you do not need to have the webhook enabled then you can disable it as shown here.

  # Turn off mutation webhook for services to avoid ordering issue
  enable_aws_load_balancer_controller = true
  aws_load_balancer_controller = {
    set = [{
      name  = "enableServiceMutatorWebhook"
      value = "false"
    }]
  }

Ref: https://github.com/aws-ia/terraform-aws-eks-blueprints-addons/blob/257677adeed1be54326637cf919cf24df6ad7c06/tests/complete/main.tf#L120-L125

We should add this to our blueprints, will mark it as a bug for tracking.