TykTechnologies / tyk-operator

Tyk Operator for Kubernetes
https://tyk.io
Other
198 stars 40 forks source link

[TT-4682] Intermittent "connection refused" for webhooks on AWS EKS #398

Closed blagerweij closed 2 years ago

blagerweij commented 2 years ago

We have deployed tyk-gateway and the tyk-operator on Amazon EKS. However, when adding a new API definition, we get intermittent errors. Sometimes creating the new apidefinition CRD works, but a lot of times we get errors reported by the webhooks:

➜  tyk-playground git:(main) ✗ kubectl apply -f api-def.yaml
Error from server (InternalError): error when creating "api-def.yaml": Internal error occurred: failed calling webhook "mapidefinition.kb.io": Post "[https://tyk-operator-webhook-service.default.svc:443/mutate-tyk-tyk-io-v1alpha1-apidefinition?timeout=10s](https://tyk-operator-webhook-service.default.svc/mutate-tyk-tyk-io-v1alpha1-apidefinition?timeout=10s)": dial tcp 10.201.112.161:9443: connect: connection refused
➜  tyk-playground git:(main) ✗ kubectl apply -f api-def.yaml
apidefinition.tyk.tyk.io/askari-api created

The error is very intermittent, about 30% of the time it succeeds, and 70% it fails. We have 3 nodes, so I'm suspecting there might be a correlation.

Expected Behavior

Creating an API Definition CRD should succeed

Current Behavior

The webhooks are intermittently failing

Steps to Reproduce

On AWS EKS, installed using the following script:

# add helm repos
helm repo add tyk-helm https://helm.tyk.io/public/helm/charts/ || echo "tyk repo already added"
helm repo add bitnami https://charts.bitnami.com/bitnami || echo "bitnami repo already added"
helm repo add jetstack https://charts.jetstack.io
helm repo update

# install redis (used by tyk)
helm install tyk-redis bitnami/redis
REDIS_PASSWORD=$(kubectl get secret tyk-redis -o jsonpath={.data.redis-password} | base64 -D)

# install cert-manager (required by tyk-operator)
# kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.4.0/cert-manager.yaml
helm install cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace --version v1.7.0 --set installCRDs=true --wait

# install tyk community edition (headless)
helm install tyk-ce tyk-helm/tyk-headless --set redis.addrs=tyk-redis-master:6379 --set redis.pass=$REDIS_PASSWORD --set gateway.tls=true

# install the CRD for the tyk-operator
kubectl apply -f https://raw.githubusercontent.com/TykTechnologies/tyk-operator/master/helm/crds/crds.yaml

# configuration for the tyk-operator is stored in a secret
kubectl create secret generic tyk-operator-conf --from-literal=TYK_AUTH=CHANGEME --from-literal=TYK_MODE=ce --from-literal=TYK_URL=https://gateway-svc-tyk-ce-tyk-headless:443 --from-literal=TYK_TLS_INSECURE_SKIP_VERIFY=true

# install the tyk operator
helm install tyk-operator tyk-helm/tyk-operator

Your Environment

AWS EKS version v1.21.5-eks-bc4871b cert-manager-v1.7.0

buraksekili commented 2 years ago

Thank you @blagerweij for raising this! We are investigating this issue right now.

As a temporary solution, webhooks can be disabled if it is feasible for your side.

asoorm commented 2 years ago

Hi @blagerweij could you try changing the tyk-ce deployment to a deployment rather than a daemonset? Then ensure that the gateway deployment is scaled to 1. The reason is that our open source gateway offering currently handles a single Gateway.

For scaling gateways & HA, we would recommend a paid license, as the Tyk Dashboard control plane is the component which is used to orchestrate APIs across one or more gateway clusters.

Let me know if this solves your problem, or if we need to keep digging into EKS.

blagerweij commented 2 years ago

We were able to track down the root cause of this issue: in addition to the tyk-operator, we also have a few other controllers running in that same namespace. Since the service for the webhooks uses a generic label (which is also used by two other controllers), the kube-dns resolution for the validating and mutating webhooks find not only the tyk operator, but also the other two controllers. These other controllers don't expose the https target port, so the webhooks fail.

  labels:
    control-plane: controller-manager
    pod-template-hash: 586948b668

And for the service:

  selector:
    control-plane: controller-manager

We're going to try to run tyk in a separate isolated namespace, to see if that will resolve the issue.

caroltyk commented 2 years ago

Hi @blagerweij thank you for the update. I'm closing this ticket if it is not an issues anymore. Please let us know if otherwise.

Cheers.

blagerweij commented 2 years ago

Hi @caroltyk, Are there any plans to improve the tyk-operator with regards to the selector? Currently the selector looks for any service with label 'control-plane: controller-manager'. Any project which has been built with kubebuilder will have that label, so it would be nice to add a tyk-specific label, so that the webhooks work even when the tyk-operator is deployed in the same namespace as another operator. IMHO that would be relatively easy to add, no ?

caroltyk commented 2 years ago

Hi @blagerweij, that makes sense. Thanks for the suggestion. I'll take it back to the team.