digitalocean / DOKS

Managed Kubernetes designed for simple and cost effective container orchestration.
https://www.digitalocean.com/products/kubernetes/
Apache License 2.0
82 stars 5 forks source link

cert-manager and metrics-server broken in 1.16 upgrade? #18

Closed jmreicha closed 4 years ago

jmreicha commented 5 years ago

After upgrading from 1.15.x to 1.16.0, it appears custom APIs seem to be broken. For example, running kubectl get apiservice shows these APIs to be unavailable.

NAME                                   SERVICE                             AVAILABLE
          AGE
...
v1beta1.metrics.k8s.io                 kube-system/metrics-server          False (FailedDiscoveryCheck)   27s
...
v1beta1.webhook.cert-manager.io        cert-manager/cert-manager-webhook   False (FailedDiscoveryCheck)   9h

Checking these APIs reveals more info.

kubectl describe apiservice v1beta1.webhook.cert-manager.io
Name:         v1beta1.webhook.cert-manager.io
Namespace:
Labels:       app=webhook
              app.kubernetes.io/instance=cert-manager
              app.kubernetes.io/managed-by=Tiller
              app.kubernetes.io/name=webhook
              helm.sh/chart=cert-manager-v0.11.0
Annotations:  cert-manager.io/inject-ca-from-secret: cert-manager/cert-manager-webhook-tls
              kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"apiregistration.k8s.io/v1beta1","kind":"APIService","metadata":{"annotations":{"cert-manager.io/inject-ca-from-secret":"cer...
API Version:  apiregistration.k8s.io/v1
Kind:         APIService
Metadata:
  Creation Timestamp:  2019-11-17T17:21:51Z
  Resource Version:    11640563
  Self Link:           /apis/apiregistration.k8s.io/v1/apiservices/v1beta1.webhook.cert-manager.io
  UID:                 f93ef948-c0bc-42de-a2b7-9ae3248d15e2
Spec:
  Ca Bundle:               <certificate>
  Group:                   webhook.cert-manager.io
  Group Priority Minimum:  1000
  Service:
    Name:            cert-manager-webhook
    Namespace:       cert-manager
    Port:            443
  Version:           v1beta1
  Version Priority:  15
Status:
  Conditions:
    Last Transition Time:  2019-11-17T17:21:51Z
    Message:               failing or missing response from https://10.245.173.138:443/apis/webhook.cert-manager.io/v1beta1: Get https://10.245.173.138:443/apis/webhook.cert-manager.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
    Reason:                FailedDiscoveryCheck
    Status:                False
    Type:                  Available
Events:                    <none>
 kubectl describe apiservice v1beta1.metrics.k8s.io
Name:         v1beta1.metrics.k8s.io
Namespace:
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"apiregistration.k8s.io/v1beta1","kind":"APIService","metadata":{"annotations":{},"name":"v1beta1.metrics.k8s.io"},"spec":{"...
API Version:  apiregistration.k8s.io/v1
Kind:         APIService
Metadata:
  Creation Timestamp:  2019-11-18T02:32:40Z
  Resource Version:    11674638
  Self Link:           /apis/apiregistration.k8s.io/v1/apiservices/v1beta1.metrics.k8s.io
  UID:                 9fe336bf-f7a9-40bd-b6fe-ca3616bc28d3
Spec:
  Group:                     metrics.k8s.io
  Group Priority Minimum:    100
  Insecure Skip TLS Verify:  true
  Service:
    Name:            metrics-server
    Namespace:       kube-system
    Port:            443
  Version:           v1beta1
  Version Priority:  100
Status:
  Conditions:
    Last Transition Time:  2019-11-18T02:32:40Z
    Message:               failing or missing response from https://10.245.47.213:443/apis/metrics.k8s.io/v1beta1: Get https://10.245.47.213:443/apis/metrics.k8s.io/v1beta1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
    Reason:                FailedDiscoveryCheck
    Status:                False
    Type:                  Available
Events:                    <none>

The services above exist in the cluster, so I'm not sure what is happening. Any thoughts or ideas on how to fix this?

timoreimann commented 5 years ago

I'd try some/all of the following for one of the applications:

  1. Check the logs of the application pod(s)
  2. Check which endpoint(s) the service maps to
  3. Check if you can reach the endpoint(s) via kubectl port-forward
  4. Check if you can reach the endpoint(s) from a running pod

Feel also free to submit a support ticket so that we can take a closer look.

jmreicha commented 5 years ago

1). Nothing in the cert-manager logs, looks like timout/connection errors in the cert manager logs.

E1118 16:33:21.799398       1 controller.go:131] cert-manager/controller/ingress-shim "msg"="re-queuing item  due to error processing" "error"="Internal error occurred: failed calling webhook \"webhook.cert-manager.io\": the server is currently unable to handle the request" "key"="default/test"

2).

metrics-server

Name:              metrics-server
Namespace:         kube-system
Labels:            kubernetes.io/cluster-service=true
                   kubernetes.io/name=Metrics-server
Annotations:       kubectl.kubernetes.io/last-applied-configuration:
                     {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"kubernetes.io/cluster-service":"true","kubernetes.io/name":"Me...
Selector:          k8s-app=metrics-server
Type:              ClusterIP
IP:                10.245.47.213
Port:              <unset>  443/TCP
TargetPort:        main-port/TCP
Endpoints:         10.244.2.116:4443
Session Affinity:  None
Events:            <none>

cert-manager

Name:              cert-manager-webhook
Namespace:         cert-manager
Labels:            app=webhook
                   app.kubernetes.io/instance=cert-manager
                   app.kubernetes.io/managed-by=Tiller
                   app.kubernetes.io/name=webhook
                   helm.sh/chart=cert-manager-v0.11.0
Annotations:       kubectl.kubernetes.io/last-applied-configuration:
                     {"apiVersion":"v1","kind":"Service","metadata":{"annotations":{},"labels":{"app":"webhook","app.kubernetes.io/instance":"cert-manager","ap...
Selector:          app.kubernetes.io/instance=cert-manager,app.kubernetes.io/managed-by=Tiller,app.kubernetes.io/name=webhook,app=webhook
Type:              ClusterIP
IP:                10.245.173.138
Port:              https  443/TCP
TargetPort:        6443/TCP
Endpoints:         10.244.3.27:6443
Session Affinity:  None
Events:            <none>

3 and 4). Can't seem to connect to these endpoints, they do respond, but I am getting a 403. Below is the response inside the cluster.

curl -k https://cert-manager-webhook.cert-manager
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {

  },
  "status": "Failure",
  "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
  "reason": "Forbidden",
  "details": {

  },
  "code": 403
}

What port is the API server configured to listen on? I wonder if the port mappings are incorrect or if there is something blocking requests to the API server?

jmreicha commented 5 years ago

Adding hostNetwork: true to the deployment spec "fixes" the issue but I'm not really sure why this would be needed.

The firewall rules in the DO console seem to indicate that the 10.0.0.0/8 network should be allowed, so I would have guessed that would include things in the Kubernetes cluster?

timoreimann commented 4 years ago

Hey @jmreicha. Sorry, this one fell off my radar.

Is this still an issue for you? If so, then I'd suggest to file a DO support ticket. That should kick off a process which is better suited to address customer support request in a reliable, short-term manner.

jmreicha commented 4 years ago

@timoreimann 👋

Still an issue, I already have a support ticket open. They said was better to use this issue 😄

timoreimann commented 4 years ago

@jmreicha did you manage to resolve the issue with our support, or am I misreading our internal communication?

jmreicha commented 4 years ago

@timoreimann Yep we got it sorted.

Just a note if anybody else comes across this issue, changing the cert-manager validating and mutating webhooks to failurePolicy: Ignore as well as restarting the control plane seems to fix the issue.

timoreimann commented 4 years ago

Thanks for the note explaining how you got this fixed, appreciated.