Open jpbetz opened 4 years ago
@matte21 Can you provide any details about what certs etcd-operator generated in this case? Do you know what dns names were in the Subject Alternative Names?
The etcd cluster members used static TLS:
apiVersion: etcd.database.coreos.com/v1beta2
kind: EtcdCluster
metadata:
name: the-etcd-cluster
namespace: example-com
spec:
size: 3
version: "3.3.11"
TLS:
static:
member:
peerSecret: etcd-peer
serverSecret: etcd-server
operatorSecret: etcd-client
...
The certificates and the secrets carrying them where created by a deployment script. Follow the cnf for the three certs.
Peer:
[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth, serverAuth
subjectAltName = @alt_names
[alt_names]
DNS.1 = *.the-etcd-cluster.example-com.svc
DNS.2 = *.the-etcd-cluster.example-com.svc.cluster.local
Server:
[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth, serverAuth
subjectAltName = @alt_names
[alt_names]
DNS.1 = *.the-etcd-cluster.example-com.svc
DNS.2 = the-etcd-cluster-client.example-com.svc
DNS.3 = the-etcd-cluster-client
DNS.4 = localhost
Client:
[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = clientAuth, serverAuth
As an addition: after the second member of the etcd cluster got stuck in failed state, the etcd operator logs contained:
time="2020-02-11T13:57:33Z" level=info msg="Start reconciling" cluster-name=the-etcd-cluster cluster-namespace=example-com pkg=cluster time="2020-02-11T13:57:33Z" level=info msg="running members: the-etcd-cluster-bxbxwtwvpf" cluster-name=the-etcd-cluster cluster-namespace=example-com pkg=cluster time="2020-02-11T13:57:33Z" level=info msg="cluster membership: the-etcd-cluster-rnq8c9wtzz,the-etcd-cluster-bxbxwtwvpf" cluster-name=the-etcd-cluster cluster-namespace=example-com pkg=cluster time="2020-02-11T13:57:33Z" level=info msg="Finish reconciling" cluster-name=the-etcd-cluster cluster-namespace=example-com pkg=cluster time="2020-02-11T13:57:33Z" level=error msg="failed to reconcile: lost quorum" cluster-name=the-etcd-cluster cluster-namespace=example-com pkg=cluster
On @MikeSpreitzer 's prompt, here's a list of issues for whom the observed behavior was similar to the one described in this issue: https://github.com/coreos/etcd-operator/issues/1330 https://github.com/coreos/etcd-operator/issues/1300 https://github.com/coreos/etcd-operator/issues/1962 https://github.com/etcd-io/etcd/issues/8803 https://github.com/etcd-io/etcd/issues/8268 https://github.com/kubernetes/kops/issues/6024
Is there a solution for this issue?
@myazid not that I know of. Are you experiencing it? If so, could you post more details?
I saw this error in the logs
2020-03-22 23:32:33.713411 I | embed: rejected connection from "10.244.1.13:53494" (error "tls: \"10.244.1.13\" does not match any of DNSNames [\"*.orchestrator-etcd-cluster.default.svc\" \"*.orchestrator-etcd-cluster.default.svc.cluster.local\"]", ServerName "orchestrator-etcd-cluster-2q52srppbq.orchestrator-etcd-cluster.default.svc", IPAddresses [], DNSNames ["*.orchestrator-etcd-cluster.default.svc" "*.orchestrator-etcd-cluster.default.svc.cluster.local"])
ectd did a reverse lookup by IP and I test lookup by IP
nslookup 10.244.1.3
3.1.244.10.in-addr.arpa name = 10-244-1-3.orchestrator-etcd-cluster-client.default.svc.cluster.local.
domain is not in the DNSNames.
Server TLS define CN: .orchestrator-etcd-cluster.default.svc SANs: .orchestrator-etcd-cluster.default.svc.cluster.local, .orchestrator-etcd-cluster-client.default.svc, .orchestrator-etcd-cluster-client.default.svc.cluster.local, orchestrator-etcd-cluster-client.default.svc.cluster.local, localhost
etcd-operator: 0.9.4 etcd: v3.4.5
Running on K8s v1.17.0
Observation: I was experiencing the same problem. However, when I configured PVCs, the etcd pod would take significant time (~50s) to start after it had been spawned. In this case, the DNS name mismatch did not occur. Hypothesis: Could it be the case that if the newly spawned pod is too quick in connecting to the existing pod, the DNS request is resolved by a cache that has not had time to pick up the recent IP addr/DNS name addition?
Reported by @matte21 on https://github.com/kubernetes/kubernetes/issues/81508#issuecomment-590646553:
cc @MikeSpreitzer @hexfusion