Closed col closed 1 year ago
It's probably also worth noting that when I configure the sidecar manually into the deployment it's working as expected.
...
- name: cloud-sql-proxy
# See: https://github.com/GoogleCloudPlatform/cloudsql-proxy
image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.6.0
args:
- "--private-ip"
- "--run-connection-test"
- "--auto-iam-authn"
- "--port=5432"
- {{ .Values.cloudsql.connectionString }}
securityContext:
runAsNonRoot: true
...
Here's the logs from the cloud-sql-proxy-operator-controller-manager
pod
2023-08-08T23:57:44Z INFO Starting Cloud SQL Proxy Operator
2023-08-08T23:57:44Z INFO Version: 1.0.2 Build: 6609bc66b4505c870387b7eb957651e2dc2fd063
2023-08-08T23:57:44Z INFO Runtime: go1.20.6 linux/amd64
2023-08-08T23:57:44Z INFO controller-runtime.metrics Metrics server is starting to listen {"addr": "127.0.0.1:8080"}
2023-08-08T23:57:44Z INFO setup Configuring reconcilers...
2023-08-08T23:57:44Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "cloudsql.cloud.google.com/v1, Kind=AuthProxyWorkload", "path": "/mutate-cloudsql-cloud-google-com-v1-authproxyworkload"}
2023-08-08T23:57:44Z INFO controller-runtime.webhook Registering webhook {"path": "/mutate-cloudsql-cloud-google-com-v1-authproxyworkload"}
2023-08-08T23:57:44Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "cloudsql.cloud.google.com/v1, Kind=AuthProxyWorkload", "path": "/validate-cloudsql-cloud-google-com-v1-authproxyworkload"}
2023-08-08T23:57:44Z INFO controller-runtime.webhook Registering webhook {"path": "/validate-cloudsql-cloud-google-com-v1-authproxyworkload"}
2023-08-08T23:57:44Z INFO controller-runtime.webhook Registering webhook {"path": "/mutate-pods"}
2023-08-08T23:57:44Z INFO setup Configuring reconcilers complete.
2023-08-08T23:57:44Z INFO setup starting manager
2023-08-08T23:57:44Z INFO controller-runtime.webhook.webhooks Starting webhook server
2023-08-08T23:57:44Z INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
2023-08-08T23:57:44Z INFO starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
I0808 23:57:44.476431 1 leaderelection.go:245] attempting to acquire leader lease cloud-sql-proxy-operator-system/76941ffa.cloud.google.com...
2023-08-08T23:57:44Z INFO controller-runtime.certwatcher Updated current TLS certificate
2023-08-08T23:57:44Z INFO controller-runtime.webhook Serving webhook server {"host": "", "port": 9443}
2023-08-08T23:57:44Z INFO controller-runtime.certwatcher Starting certificate watcher
I0808 23:57:44.499655 1 leaderelection.go:255] successfully acquired lease cloud-sql-proxy-operator-system/76941ffa.cloud.google.com
2023-08-08T23:57:44Z DEBUG events cloud-sql-proxy-operator-controller-manager-7dbd494547-8nvcn_84c162e1-d880-445f-97d2-e260e5d467de became leader {"type": "Normal", "object": {"kind":"Lease","namespace":"cloud-sql-proxy-operator-system","name":"76941ffa.cloud.google.com","uid":"e5ecc44d-7467-461f-a156-2327d6398ee2","apiVersion":"coordination.k8s.io/v1","resourceVersion":"417723155"}, "reason": "LeaderElection"}
2023-08-08T23:57:44Z INFO Starting EventSource {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "source": "kind source: *v1.Pod"}
2023-08-08T23:57:44Z INFO Starting EventSource {"controller": "authproxyworkload", "controllerGroup": "cloudsql.cloud.google.com", "controllerKind": "AuthProxyWorkload", "source": "kind source: *v1.AuthProxyWorkload"}
2023-08-08T23:57:44Z INFO Starting Controller {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod"}
2023-08-08T23:57:44Z INFO Starting Controller {"controller": "authproxyworkload", "controllerGroup": "cloudsql.cloud.google.com", "controllerKind": "AuthProxyWorkload"}
2023-08-08T23:57:44Z INFO Starting workers {"controller": "authproxyworkload", "controllerGroup": "cloudsql.cloud.google.com", "controllerKind": "AuthProxyWorkload", "worker count": 1}
2023-08-08T23:57:44Z INFO Starting workers {"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "worker count": 1}
Ok, I'm starting to narrow this down. It looks like an issue with the certificate for the webhook service.
When I try to access the service from the target deployment I get:
~ $ wget https://cloud-sql-proxy-operator-webhook-service.cloud-sql-proxy-operator-system.svc:443
Connecting to cloud-sql-proxy-operator-webhook-service.cloud-sql-proxy-operator-system.svc:443 (192.168.125.63:443)
488BDC6C357F0000:error:0A000086:SSL routines:tls_post_process_server_certificate:certificate verify failed:ssl/statem/statem_clnt.c:1889:
ssl_client: SSL_connect
wget: error getting response: Connection reset by peer
and with curl I get:
$ curl https://cloud-sql-proxy-operator-webhook-service.cloud-sql-proxy-operator-system.svc:443
curl: (60) SSL certificate problem: self signed certificate
More details here: https://curl.se/docs/sslcerts.html
curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
verbose version:
$ curl -v https://cloud-sql-proxy-operator-webhook-service.cloud-sql-proxy-operator-system.svc:443
* Trying 192.168.125.63:443...
* Connected to cloud-sql-proxy-operator-webhook-service.cloud-sql-proxy-operator-system.svc (192.168.125.63) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
* CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS alert, unknown CA (560):
* SSL certificate problem: self signed certificate
* Closing connection 0
curl: (60) SSL certificate problem: self signed certificate
More details here: https://curl.se/docs/sslcerts.html
curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
So is the operator supposed to be using self-signed certs? Am I barking up the wrong tree here?
Sorry for the slow response here @col. I'm a bit surprised to see the https
in there, as the operator isn't provisioning any valid certs as far as I know.
The primary maintainer @hessjcg is out of office presently and he'll have a better understanding here. Nonetheless, let me see if we're misconfiguring something here.
After digging a bit further it looked to me that using self-signed certs is expected and not actually the problem. I now believe the webhook request is connecting but failing to complete successfully, hence the "context deadline exceeded" error. That's pretty much where my knowledge on this stops though. I have no idea what this webhook actually does or how to debug it further.
Hi @col, I'm the primary maintainer. Thanks for reporting this. You are helping to make the operator easier to use.
Here are a few things we can rule out:
/mutate-cloudsql-cloud-google-com-v1-authproxyworkload
webhook does not have the ability to cause a request to timeout. It simply sets default values and returns. cloud-sql-proxy-operator-webhook-service
service. (You're correct, self-signed certs are normal.)It looks like the problem happens somewhere in the k8s networking configuration. Maybe the k8s manager can't resolve the DNS of the webhook. Maybe there are network security policies that prevent k8s manager from establishing a connection to the webhook like in this old cert manager issue.
Thanks for the response @hessjcg
We're not using Calico or any other network security policy system.
We're also not using any service mesh like Istio, Kong or linkerd etc.
I'll have a read through the cert manager issue and see if anything correlates to my issue
Quick update - I was not able to make any progress on this issue and have since given up. We're now relying on some manually configured CloudSQL Proxy sidecars in some cases and direct connections with certs in other cases.
Hi Col.
I'm sorry to hear that you didn't find a solution. I'd like to try to get to the bottom of this in case other users are also having the same problem. I'm wondering if there are other possible reasons why the kubernetes API server would time out attempting to call the operator's webhook service.
Does your GKE cluster run on a private VPC? Are there multiple node pools on this cluster? Are there additional firewall rules or security constraints on the VMs in the GKE node pool?
Thanks, -Jonathan
It's unfortunate that we haven't been able to pinpoint the exact cause of the deployment issue @col had. However, it's worth considering the possibility that it could be related to firewall configurations within the GKE private cluster.
By default, firewall rules restrict the cluster control plane to only initiate TCP connections to Nodes and Pods on ports 443 (HTTPS) and 10250 (Kubelet) as described here. The webhook service uses port 443, but the api-server talks directly to the pods, which listen on port 9443 for the webhook. So, you will need to either add a firewall rule that allows master nodes access to port 9443/TCP on worker nodes or change the existing rule that allows access to port 443/TCP and 10250/TCP to also allow access to port 9443/TCP.
I recommend adding a warning to the Installation step that includes a link to the dedicated GKE page, guiding users on how to create the necessary firewall rules for the GKE private clusters @hessjcg
Thanks @serdarkkts. This is super helpful. We'll add something to the README.
I'm going to mark this as closed as the firewall point above seems like a compelling explanation. Feel free to re-open if we've broken something here. Meanwhile, I filed #427 so people don't trip over this in the future.
I'm getting an error when trying to create an
AuthProxyWorkload
viakubectl apply -f authproxyworkload.yaml
authproxyworkload.yaml
contentExpected Behaviour
The
AuthProxyWorkload
resource to be created, and for a container to be added to the deployment which is running the cloud-sql-proxy.Actual Behaviour
Error:
I'm a bit stuck at this point as I don't know how to debug this any further.
Steps to Reproduce the Problem
cloud-sql-proxy-operator-controller-manager
is runningCloud SQL Client
role to my Google Service Accountiam.workloadIdentityUser
for my K8s service accountauthproxyworkload.yaml
file (see above)Specifications