GoogleCloudPlatform / cloud-sql-proxy-operator

A Kubernetes Operator to automatically configure secure connections to Cloud SQL
Apache License 2.0
95 stars 10 forks source link

Error creating AuthProxyWorkload - failed calling webhook - context deadline exceeded #403

Closed col closed 1 year ago

col commented 1 year ago

I'm getting an error when trying to create an AuthProxyWorkload via kubectl apply -f authproxyworkload.yaml

authproxyworkload.yaml content

apiVersion: cloudsql.cloud.google.com/v1
kind: AuthProxyWorkload
metadata:
  name: my-service-authproxyworkload
spec:
  workloadSelector:
    kind: "Deployment"
    name: "my-service"
  instances:
    - connectionString: "my-project:asia-southeast1:my-sql-instance"
      port: 5432
      hostEnvName: "DATABASE_HOST"
      autoIAMAuthN: true
      privateIP: true

Expected Behaviour

The AuthProxyWorkload resource to be created, and for a container to be added to the deployment which is running the cloud-sql-proxy.

Actual Behaviour

Error:

Error from server (InternalError): error when creating "authproxyworkload.yaml": Internal error occurred: failed calling webhook "mauthproxyworkload.kb.io": failed to call webhook: Post "https://cloud-sql-proxy-operator-webhook-service.cloud-sql-proxy-operator-system.svc:443/mutate-cloudsql-cloud-google-com-v1-authproxyworkload?timeout=10s": context deadline exceeded

I'm a bit stuck at this point as I don't know how to debug this any further.

Steps to Reproduce the Problem

  1. Install Cloud SQL Proxy Operator
  2. Confirm cloud-sql-proxy-operator-controller-manager is running
  3. Assign Cloud SQL Client role to my Google Service Account
  4. Create iam.workloadIdentityUser for my K8s service account
  5. Apply the authproxyworkload.yaml file (see above)

Specifications

col commented 1 year ago

It's probably also worth noting that when I configure the sidecar manually into the deployment it's working as expected.

...
        - name: cloud-sql-proxy
          # See: https://github.com/GoogleCloudPlatform/cloudsql-proxy
          image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.6.0
          args:
            - "--private-ip"
            - "--run-connection-test"
            - "--auto-iam-authn"
            - "--port=5432"
            - {{ .Values.cloudsql.connectionString }}
          securityContext:
            runAsNonRoot: true
...            
col commented 1 year ago

Here's the logs from the cloud-sql-proxy-operator-controller-manager pod

2023-08-08T23:57:44Z    INFO    Starting Cloud SQL Proxy Operator
2023-08-08T23:57:44Z    INFO    Version: 1.0.2 Build: 6609bc66b4505c870387b7eb957651e2dc2fd063
2023-08-08T23:57:44Z    INFO    Runtime: go1.20.6 linux/amd64
2023-08-08T23:57:44Z    INFO    controller-runtime.metrics  Metrics server is starting to listen    {"addr": "127.0.0.1:8080"}
2023-08-08T23:57:44Z    INFO    setup   Configuring reconcilers...
2023-08-08T23:57:44Z    INFO    controller-runtime.builder  Registering a mutating webhook  {"GVK": "cloudsql.cloud.google.com/v1, Kind=AuthProxyWorkload", "path": "/mutate-cloudsql-cloud-google-com-v1-authproxyworkload"}
2023-08-08T23:57:44Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/mutate-cloudsql-cloud-google-com-v1-authproxyworkload"}
2023-08-08T23:57:44Z    INFO    controller-runtime.builder  Registering a validating webhook    {"GVK": "cloudsql.cloud.google.com/v1, Kind=AuthProxyWorkload", "path": "/validate-cloudsql-cloud-google-com-v1-authproxyworkload"}
2023-08-08T23:57:44Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/validate-cloudsql-cloud-google-com-v1-authproxyworkload"}
2023-08-08T23:57:44Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/mutate-pods"}
2023-08-08T23:57:44Z    INFO    setup   Configuring reconcilers complete.
2023-08-08T23:57:44Z    INFO    setup   starting manager
2023-08-08T23:57:44Z    INFO    controller-runtime.webhook.webhooks Starting webhook server
2023-08-08T23:57:44Z    INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
2023-08-08T23:57:44Z    INFO    starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
I0808 23:57:44.476431       1 leaderelection.go:245] attempting to acquire leader lease cloud-sql-proxy-operator-system/76941ffa.cloud.google.com...
2023-08-08T23:57:44Z    INFO    controller-runtime.certwatcher  Updated current TLS certificate
2023-08-08T23:57:44Z    INFO    controller-runtime.webhook  Serving webhook server  {"host": "", "port": 9443}
2023-08-08T23:57:44Z    INFO    controller-runtime.certwatcher  Starting certificate watcher
I0808 23:57:44.499655       1 leaderelection.go:255] successfully acquired lease cloud-sql-proxy-operator-system/76941ffa.cloud.google.com
2023-08-08T23:57:44Z    DEBUG   events  cloud-sql-proxy-operator-controller-manager-7dbd494547-8nvcn_84c162e1-d880-445f-97d2-e260e5d467de became leader {"type": "Normal", "object": {"kind":"Lease","namespace":"cloud-sql-proxy-operator-system","name":"76941ffa.cloud.google.com","uid":"e5ecc44d-7467-461f-a156-2327d6398ee2","apiVersion":"coordination.k8s.io/v1","resourceVersion":"417723155"}, "reason": "LeaderElection"}
2023-08-08T23:57:44Z    INFO    Starting EventSource    {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "source": "kind source: *v1.Pod"}
2023-08-08T23:57:44Z    INFO    Starting EventSource    {"controller": "authproxyworkload", "controllerGroup": "cloudsql.cloud.google.com", "controllerKind": "AuthProxyWorkload", "source": "kind source: *v1.AuthProxyWorkload"}
2023-08-08T23:57:44Z    INFO    Starting Controller {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod"}
2023-08-08T23:57:44Z    INFO    Starting Controller {"controller": "authproxyworkload", "controllerGroup": "cloudsql.cloud.google.com", "controllerKind": "AuthProxyWorkload"}
2023-08-08T23:57:44Z    INFO    Starting workers    {"controller": "authproxyworkload", "controllerGroup": "cloudsql.cloud.google.com", "controllerKind": "AuthProxyWorkload", "worker count": 1}
2023-08-08T23:57:44Z    INFO    Starting workers    {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "worker count": 1}
col commented 1 year ago

Ok, I'm starting to narrow this down. It looks like an issue with the certificate for the webhook service.

When I try to access the service from the target deployment I get:

~ $ wget https://cloud-sql-proxy-operator-webhook-service.cloud-sql-proxy-operator-system.svc:443
Connecting to cloud-sql-proxy-operator-webhook-service.cloud-sql-proxy-operator-system.svc:443 (192.168.125.63:443)
488BDC6C357F0000:error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:1889:
ssl_client: SSL_connect
wget: error getting response: Connection reset by peer

and with curl I get:

$ curl https://cloud-sql-proxy-operator-webhook-service.cloud-sql-proxy-operator-system.svc:443
curl: (60) SSL certificate problem: self signed certificate
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

verbose version:

$ curl -v https://cloud-sql-proxy-operator-webhook-service.cloud-sql-proxy-operator-system.svc:443
*   Trying 192.168.125.63:443...
* Connected to cloud-sql-proxy-operator-webhook-service.cloud-sql-proxy-operator-system.svc (192.168.125.63) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS alert, unknown CA (560):
* SSL certificate problem: self signed certificate
* Closing connection 0
curl: (60) SSL certificate problem: self signed certificate
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

So is the operator supposed to be using self-signed certs? Am I barking up the wrong tree here?

enocom commented 1 year ago

Sorry for the slow response here @col. I'm a bit surprised to see the https in there, as the operator isn't provisioning any valid certs as far as I know.

The primary maintainer @hessjcg is out of office presently and he'll have a better understanding here. Nonetheless, let me see if we're misconfiguring something here.

col commented 1 year ago

After digging a bit further it looked to me that using self-signed certs is expected and not actually the problem. I now believe the webhook request is connecting but failing to complete successfully, hence the "context deadline exceeded" error. That's pretty much where my knowledge on this stops though. I have no idea what this webhook actually does or how to debug it further.

hessjcg commented 1 year ago

Hi @col, I'm the primary maintainer. Thanks for reporting this. You are helping to make the operator easier to use.

Here are a few things we can rule out:

It looks like the problem happens somewhere in the k8s networking configuration. Maybe the k8s manager can't resolve the DNS of the webhook. Maybe there are network security policies that prevent k8s manager from establishing a connection to the webhook like in this old cert manager issue.

col commented 1 year ago

Thanks for the response @hessjcg

We're not using Calico or any other network security policy system.

We're also not using any service mesh like Istio, Kong or linkerd etc.

I'll have a read through the cert manager issue and see if anything correlates to my issue

col commented 1 year ago

Quick update - I was not able to make any progress on this issue and have since given up. We're now relying on some manually configured CloudSQL Proxy sidecars in some cases and direct connections with certs in other cases.

hessjcg commented 1 year ago

Hi Col.

I'm sorry to hear that you didn't find a solution. I'd like to try to get to the bottom of this in case other users are also having the same problem. I'm wondering if there are other possible reasons why the kubernetes API server would time out attempting to call the operator's webhook service.

Does your GKE cluster run on a private VPC? Are there multiple node pools on this cluster? Are there additional firewall rules or security constraints on the VMs in the GKE node pool?

Thanks, -Jonathan

serdarkkts commented 1 year ago

It's unfortunate that we haven't been able to pinpoint the exact cause of the deployment issue @col had. However, it's worth considering the possibility that it could be related to firewall configurations within the GKE private cluster.

By default, firewall rules restrict the cluster control plane to only initiate TCP connections to Nodes and Pods on ports 443 (HTTPS) and 10250 (Kubelet) as described here. The webhook service uses port 443, but the api-server talks directly to the pods, which listen on port 9443 for the webhook. So, you will need to either add a firewall rule that allows master nodes access to port 9443/TCP on worker nodes or change the existing rule that allows access to port 443/TCP and 10250/TCP to also allow access to port 9443/TCP.

I recommend adding a warning to the Installation step that includes a link to the dedicated GKE page, guiding users on how to create the necessary firewall rules for the GKE private clusters @hessjcg

enocom commented 1 year ago

Thanks @serdarkkts. This is super helpful. We'll add something to the README.

enocom commented 1 year ago

I'm going to mark this as closed as the firewall point above seems like a compelling explanation. Feel free to re-open if we've broken something here. Meanwhile, I filed #427 so people don't trip over this in the future.