Open a-roberts opened 4 years ago
Have tried adding a readiness probe in Manager.yaml but still seeing this happening:
apiVersion: v1
kind: Namespace
metadata:
labels:
control-plane: controller-manager
name: system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: controller-manager
namespace: system
labels:
control-plane: controller-manager
app: flink-operator
spec:
selector:
matchLabels:
control-plane: controller-manager
app: flink-operator
replicas: 1
template:
metadata:
labels:
control-plane: controller-manager
app: flink-operator
spec:
containers:
- name: flink-operator
readinessProbe:
failureThreshold: 30
httpGet:
path: /mutate-flinkoperator-k8s-io-v1beta1-flinkcluster
port: webhook-server
scheme: HTTPS
periodSeconds: 1
successThreshold: 1
timeoutSeconds: 30
image: flink-operator:latest
command:
- /flink-operator
args:
- --enable-leader-election
resources:
limits:
cpu: 100m
memory: 30Mi
requests:
cpu: 100m
memory: 20Mi
terminationGracePeriodSeconds: 10
still no good.
The MutatingWebhookConfiguration has:
service:
name: flink-operator-webhook-service
namespace: abp
path: /mutate-flinkoperator-k8s-io-v1beta1-flinkcluster
port: 443
which is where my path comes from. The caBundle in there matches with what is in the webhook-server-cert secret.
If i exec into another pod in the cluster with curl, and use the service ip:
bash-4.4$ curl -k https://172.30.28.117:443
404 page not found
bash-4.4$ curl -k https://172.30.28.117:443
404 page not found
bash-4.4$ curl -k https://172.30.28.117:443
404 page not found
bash-4.4$ curl -k https://172.30.28.117:443
404 page not found
bash-4.4$
bash-4.4$ curl -k https://172.30.28.117:443
curl: (7) Failed to connect to 172.30.28.117 port 443: Connection refused
bash-4.4$
from
flink-operator-webhook-service ClusterIP 172.30.28.117 <none> 443/TCP 16d
with -vvv:
bash-4.4$ curl -k -vvv https://172.30.28.117:443
* Rebuilt URL to: https://172.30.28.117:443/
* Trying 172.30.28.117...
* TCP_NODELAY set
* connect to 172.30.28.117 port 443 failed: Connection refused
* Failed to connect to 172.30.28.117 port 443: Connection refused
* Closing connection 0
curl: (7) Failed to connect to 172.30.28.117 port 443: Connection refused
bash-4.4$ curl -k -vvv https://172.30.28.117:443
* Rebuilt URL to: https://172.30.28.117:443/
* Trying 172.30.28.117...
* TCP_NODELAY set
* Connected to 172.30.28.117 (172.30.28.117) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/pki/tls/certs/ca-bundle.crt
CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, [no content] (0):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
* subject: CN=flink-operator-webhook-service.abp.svc
* start date: Sep 3 13:31:23 2020 GMT
* expire date: Sep 25 03:08:10 2020 GMT
* issuer: CN=kube-csr-signer_@1598411290
* SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* Using Stream ID: 1 (easy handle 0x556f430f9720)
* TLSv1.3 (OUT), TLS app data, [no content] (0):
> GET / HTTP/2
> Host: 172.30.28.117
> User-Agent: curl/7.61.1
> Accept: */*
>
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
* TLSv1.3 (OUT), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
* TLSv1.3 (IN), TLS app data, [no content] (0):
< HTTP/2 404
< content-type: text/plain; charset=utf-8
< x-content-type-options: nosniff
< content-length: 19
< date: Thu, 03 Sep 2020 14:42:16 GMT
<
* TLSv1.3 (IN), TLS app data, [no content] (0):
404 page not found
* Connection #0 to host 172.30.28.117 left intact
why does it decide to just not use the cert? Or whatever's on port 443 decides it doesn't want to talk with tls?
Curling from myself (https://localhost:443) always works, but it's from another pod that's the problem.
any updates on this?
Hey all, we've been using the operator for a while and we've noticed some flakiness when applying the FlinkCluster CR. Specifically we see this, and we're wondering if perhaps it's due to the pod not actually being ready (and having the self-signed cert created and available in time).
If we then wait a while (around a minute, and I actually did another make deploy at this stage):
One idea would be to enhance the existing readiness mechanism we have - I'm wondering if anyone else has experienced this and what a solution may be.
Thanks!
@enriquel8