Connection refused applying FlinkCluster CR - yet pod is reported as ready and running

Hey all, we've been using the operator for a while and we've noticed some flakiness when applying the FlinkCluster CR. Specifically we see this, and we're wondering if perhaps it's due to the pod not actually being ready (and having the self-signed cert created and available in time).

base) Adams-MBP:abp-flink-operator aroberts$ k apply -f network-policy-cluster-flink-1.11.yaml networkpolicy.networking.k8s.io/network-job unchanged
networkpolicy.networking.k8s.io/network-policy-taskmanager unchanged
networkpolicy.networking.k8s.io/network-policy-jobmanager unchanged
Error from server (InternalError): error when creating "network-policy-cluster-flink-1.11.yaml": Internal error occurred: failed calling webhook "mflinkcluster.flinkoperator.k8s.io": Post https://flink-operator-webhook-service.abp.svc:443/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster?timeout=30s: dial tcp 10.254.20.21:443: connect: connection refused

(base) Adams-MBP:abp-flink-operator aroberts$ k apply -f network-policy-cluster-flink-1.11.yaml networkpolicy.networking.k8s.io/network-job unchanged
networkpolicy.networking.k8s.io/network-policy-taskmanager unchanged
networkpolicy.networking.k8s.io/network-policy-jobmanager unchanged
Error from server (InternalError): error when creating "network-policy-cluster-flink-1.11.yaml": Internal error occurred: failed calling webhook "mflinkcluster.flinkoperator.k8s.io": Post https://flink-operator-webhook-service.abp.svc:443/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster?timeout=30s: dial tcp 10.254.20.21:443: connect: connection refused

(base) Adams-MBP:abp-flink-operator aroberts$ k apply -f network-policy-cluster-flink-1.11.yaml networkpolicy.networking.k8s.io/network-job unchanged
networkpolicy.networking.k8s.io/network-policy-taskmanager unchanged
networkpolicy.networking.k8s.io/network-policy-jobmanager unchanged
Error from server (InternalError): error when creating "network-policy-cluster-flink-1.11.yaml": Internal error occurred: failed calling webhook "mflinkcluster.flinkoperator.k8s.io": Post https://flink-operator-webhook-service.abp.svc:443/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster?timeout=30s: dial tcp 10.254.20.21:443: connect: connection refused
(base) Adams-MBP:abp-flink-operator aroberts$ k apply -f network-policy-cluster-flink-1.11.yaml  

networkpolicy.networking.k8s.io/network-job unchanged
networkpolicy.networking.k8s.io/network-policy-taskmanager unchanged
networkpolicy.networking.k8s.io/network-policy-jobmanager unchanged
Error from server (InternalError): error when creating "network-policy-cluster-flink-1.11.yaml": Internal error occurred: failed calling webhook "mflinkcluster.flinkoperator.k8s.io": Post https://flink-operator-webhook-service.abp.svc:443/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster?timeout=30s: net/http: TLS handshake timeout

If we then wait a while (around a minute, and I actually did another make deploy at this stage):

(base) Adams-MBP:abp-flink-operator aroberts$ k apply -f network-policy-cluster-flink-1.11.yaml 
flinkcluster.flinkoperator.k8s.io/tls-flink-cluster-1-11 created
networkpolicy.networking.k8s.io/network-job unchanged
networkpolicy.networking.k8s.io/network-policy-taskmanager unchanged
networkpolicy.networking.k8s.io/network-policy-jobmanager unchanged

One idea would be to enhance the existing readiness mechanism we have - I'm wondering if anyone else has experienced this and what a solution may be.

Thanks!

@enriquel8

Have tried adding a readiness probe in Manager.yaml but still seeing this happening:

apiVersion: v1
kind: Namespace
metadata:
  labels:
    control-plane: controller-manager
  name: system
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: controller-manager
  namespace: system
  labels:
    control-plane: controller-manager
    app: flink-operator
spec:
  selector:
    matchLabels:
      control-plane: controller-manager
      app: flink-operator
  replicas: 1
  template:
    metadata:
      labels:
        control-plane: controller-manager
        app: flink-operator
    spec:
      containers:
      - name: flink-operator
        readinessProbe:
          failureThreshold: 30
          httpGet:
            path: /mutate-flinkoperator-k8s-io-v1beta1-flinkcluster
            port: webhook-server
            scheme: HTTPS
          periodSeconds: 1
          successThreshold: 1
          timeoutSeconds: 30
        image: flink-operator:latest
        command:      
        - /flink-operator
        args:
        - --enable-leader-election
        resources:
          limits:
            cpu: 100m
            memory: 30Mi
          requests:
            cpu: 100m
            memory: 20Mi
      terminationGracePeriodSeconds: 10

still no good.

The MutatingWebhookConfiguration has:

      service:
        name: flink-operator-webhook-service
        namespace: abp
        path: /mutate-flinkoperator-k8s-io-v1beta1-flinkcluster
        port: 443

which is where my path comes from. The caBundle in there matches with what is in the webhook-server-cert secret.

If i exec into another pod in the cluster with curl, and use the service ip:

bash-4.4$ curl -k https://172.30.28.117:443
404 page not found
bash-4.4$ curl -k https://172.30.28.117:443
404 page not found
bash-4.4$ curl -k https://172.30.28.117:443
404 page not found
bash-4.4$ curl -k https://172.30.28.117:443
404 page not found
bash-4.4$ 
bash-4.4$ curl -k https://172.30.28.117:443
curl: (7) Failed to connect to 172.30.28.117 port 443: Connection refused
bash-4.4$

from

flink-operator-webhook-service                                ClusterIP   172.30.28.117    <none>        443/TCP                               16d

with -vvv:

bash-4.4$ curl -k -vvv https://172.30.28.117:443
* Rebuilt URL to: https://172.30.28.117:443/
*   Trying 172.30.28.117...
* TCP_NODELAY set
* connect to 172.30.28.117 port 443 failed: Connection refused
* Failed to connect to 172.30.28.117 port 443: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 172.30.28.117 port 443: Connection refused

bash-4.4$ curl -k -vvv https://172.30.28.117:443
* Rebuilt URL to: https://172.30.28.117:443/
*   Trying 172.30.28.117...
* TCP_NODELAY set
* Connected to 172.30.28.117 (172.30.28.117) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, [no content] (0):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=flink-operator-webhook-service.abp.svc
*  start date: Sep  3 13:31:23 2020 GMT
*  expire date: Sep 25 03:08:10 2020 GMT
*  issuer: CN=kube-csr-signer_@1598411290
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* Using Stream ID: 1 (easy handle 0x556f430f9720)
* TLSv1.3 (OUT), TLS app data, [no content] (0):
> GET / HTTP/2
> Host: 172.30.28.117
> User-Agent: curl/7.61.1
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
< HTTP/2 404 
< content-type: text/plain; charset=utf-8
< x-content-type-options: nosniff
< content-length: 19
< date: Thu, 03 Sep 2020 14:42:16 GMT
< 
* TLSv1.3 (IN), TLS app data, [no content] (0):
404 page not found
* Connection #0 to host 172.30.28.117 left intact

why does it decide to just not use the cert? Or whatever's on port 443 decides it doesn't want to talk with tls?

Curling from myself (https://localhost:443) always works, but it's from another pod that's the problem.

GoogleCloudPlatform / flink-on-k8s-operator

Connection refused applying FlinkCluster CR - yet pod is reported as ready and running #315