Open ahrtr opened 1 year ago
Thanks Ben, I think we need to discuss what behavior we want to fix. Clearly this is a regression in 3.5 (and 3.4?), but implicitly blocking the bootstrap config on resolving a DNS entry in an equality check is also not a great idea.
I would rather have the config.go explicitly lookup the important URLs and fail with a meaningful error message when they can't be resolved (akin to the init container the hypershift team added in https://github.com/openshift/hypershift/pull/1985). Then validate whether the URLs and their IP configs are consistent and equal. Potentially with a cmd flag to turn off that check.
we've seen troubles with wildcard dns names matching the server name early
In case I misunderstood anything, could you clarify this? How did wildcard dns names matching the server name early
happen? Do you have detailed steps to reproduce this issue?
I only have fragments of the whole issue myself, I'll come back with a reproducer in a test case once I receive more information later today.
As far as I understood, the issue stems from the fact that the URLs somehow match this logic: https://github.com/etcd-io/etcd/pull/14573/files#diff-669913e187d4dc726dcb9e01a5666ae0562549545138d25d2c8fc2ceb3e23d30R151-R165
My assumption was the wildcard is ignored, thus matching only on the host which is equal, which in turn will not resolve the DNS anymore. That previously would've blocked for a while to resolve, thus "preventing" etcd from starting.
It seems not related to https://github.com/etcd-io/etcd/pull/13224. The function urlsEqual
just compares two slices of URL, and skip resolving DNS if they are deep equal. The logic seems simple and correct to me.
Anyway, please clarify the issue and provide more detailed info firstly. FYI. https://github.com/ahrtr/etcd-issues/blob/master/docs/troubleshooting/sanity_check_and_collect_basic_info.md
The logic seems simple and correct to me.
The logic is not the issue, it's the ordering that breaks an implicit assumption users had about the startup of etcd.
Here's how the team runs etcd inside k8s:
/usr/bin/etcd
--data-dir=/var/lib/data
--name=etcd-0
--initial-advertise-peer-urls=https://etcd-0.etcd-discovery.clusters-cewong-guest.svc:2380
--listen-peer-urls=https://10.137.0.28:2380
--listen-client-urls=https://10.137.0.28:2379,https://localhost:2379
--advertise-client-urls=https://etcd-0.etcd-client.clusters-cewong-guest.svc:2379
--listen-metrics-urls=https://0.0.0.0:2382
--initial-cluster-token=etcd-cluster
--initial-cluster=etcd-0=https://etcd-0.etcd-discovery.clusters-cewong-guest.svc:2380,etcd-1=https://etcd-1.etcd-discovery.clusters-cewong-guest.svc:2380,etcd-2=https://etcd-2.etcd-discovery.clusters-cewong-guest.svc:2380
--initial-cluster-state=new
--quota-backend-bytes=4294967296
--snapshot-count=10000
--peer-client-cert-auth=true
--peer-cert-file=/etc/etcd/tls/peer/peer.crt
--peer-key-file=/etc/etcd/tls/peer/peer.key
--peer-trusted-ca-file=/etc/etcd/tls/etcd-ca/ca.crt
--client-cert-auth=true
--cert-file=/etc/etcd/tls/server/server.crt
--key-file=/etc/etcd/tls/server/server.key
--trusted-ca-file=/etc/etcd/tls/etcd-ca/ca.crt
If you step through it with a debugger, you notice the path in config.go trying to validate the peer urls: https://github.com/etcd-io/etcd/blob/6200b22f79abb6e9c4e6126f72ded616d85546c4/server/config/config.go#L255
in the above case, the urls that are compared are equal as:
which is correct, previously the method netutil.URLStringsEqual
would block until the IP resolves. Since 3.5.6 it just continues to start the server.
That then causes errors on the listener_tls.go that looks like this:
{"level":"warn","ts":"2022-12-13T11:29:51.254Z","caller":"embed/config_logging.go:160","msg":"rejected connection",
"remote-addr":"10.132.1.27:42760",
"server-name":"etcd-0.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc",
"ip-addresses":[],
"dns-names":[
"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc",
"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc.cluster.local",
"127.0.0.1",
"::1"],
"error":"tls: \"10.132.1.27\" does not match any of DNSNames [\"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc\" \"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc.cluster.local\" \"127.0.0.1\" \"::1\"]"}
Which led us to believe it was an issue with the wildcard matching - apologies for the red herring. It's merely a difference in waiting for DNS resolution.
here's my attempt to make this a bit more explicit during bootstrap: https://github.com/etcd-io/etcd/pull/15064
The startup would then fail with:
{"level":"warn","ts":"2023-01-05T16:33:32.901+0100","caller":"netutil/netutil.go:121","msg":"failed to resolve URL Host","url":"https://etcd-0.etcd-discovery.clusters-cewong-guest.svc:2380","host":"etcd-0.etcd-discovery.clusters-cewong-guest.svc:2380","retry-interval":"1s","error":"lookup etcd-0.etcd-discovery.clusters-cewong-guest.svc: no such host"}
Have you confirmed that https://github.com/etcd-io/etcd/pull/15064 can resolve your issue?
Based on the error message above in https://github.com/etcd-io/etcd/issues/15062#issuecomment-1372324491, the etcd server rejected the client connection due to client hostname mismatching the DNSName in certificate. FYI. https://etcd.io/docs/v3.5/op-guide/security/#notes-for-tls-authentication
Have you changed anything recently (e.g. client certificate)? If not, probably does it mean the you just need to wait some time for the reverse lookup to work?
"error":"tls: \"10.132.1.27\" does not match any of DNSNames [\"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc\" \"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc.cluster.local\" \"127.0.0.1\" \"::1\"]"}
Did you see this error only in a short period after the etcd gets started or all the time?
@csrwng can you quickly pitch in? you validated it yesterday with this PR with success.
@ahrtr The error occurs continuously for about 5 minutes and then eventually resolves itself. This looks to be a timing issue. If a new cluster member starts attempting to contact its peers without waiting ~2-3 secs for its name to be resolvable, then peers will start reporting the error above and will not accept traffic from the new member until after 5 minutes have passed. Having the check that ensures the name is resolvable before starting the member prevents the communication errors from the peer.
I have confirmed that this change does fix the issue.
The error occurs continuously for about 5 minutes and then eventually resolves itself. This looks to be a timing issue.
Do you mean etcd keeps being restarted (failing due to the error "does not match any of DNSNames
", and gets started again soon; then fails again...) for about 5 minutes? Have you figured out the root cause why the error occurs continuously for about 5 minutes?
It looks like a DNS issue to me. Adding an init-container to make sure the service (e.g. etcd-0.etcd-discovery.clusters-cewong-guest.svc
) is resolvable is a correct workaround to me; but I'd suggest to figure out the root cause why the service isn't resolvable in the beginning about 5 minutes.
Just to double confirm... you only upgraded to the etcd 3.5.6, and did not change anything else, then you ran into this issue? Have you changed anything else (e.g client or peer certificate, or DNS settings)?
It looks like a DNS issue to me.
While I agree it's a DNS issue, the etcd behavior has changed with the patch release due to the backport. I'm not sure anyone else is relying on that, but it would still be good if the 3.5 releases could behave the same way.
Do you mean etcd keeps being restarted (failing due to the error "
does not match any of DNSNames
", and gets started again soon; then fails again...) for about 5 minutes?
No, the error is reported by an established cluster member as a new member tries to join. And yes the error occurs for about 5 minutes, even though the dns name was not resolvable only for about 2-3 secs.
Just to double confirm... you only upgraded to the etcd 3.5.6, and did not change anything else, then you ran into this issue?
Yes
I created a couple of diagrams to better explain what is happening:
Thanks @csrwng for the feedback. But I am confused.
does not match any of DNSNames
", but you provided error message "failed to resolve URL Host
". Are you talking about two different issues? (*ServerConfig) advertiseMatchesCluster()
is modified, but it's only called in (*ServerConfig)VerifyBootstrap, which is only called in the case of No local WAL files and new cluster (Please ignore the etcdutl
use case). Your case is related to a member joining an existing cluster, how can the PR fix your issue? The two biggest concerns are:
In order to avoid too much back and forth session, could you please clarify the issue firstly, and follow collect-basic-info to collect & provide the required info?
Previously @tjungblu provided error message "does not match any of DNSNames", but you provided error message "failed to resolve URL Host". Are you talking about two different issues?
Sorry that is my bad, I used the wrong error message in my diagram. It is "does not match any of DNSNames".
Your case is related to a member joining an existing cluster, how can the PR fix your issue?
This is a cluster defined by a statefulset. All members start at roughly the same time. There is no local WAL file when members start, but how does it know if it's a new cluster or not ?
Why the hostname is unresolved for about 2~3 seconds?
The dns name used for the member is that of a headless service associated with the statefulset. I imagine that after the pod starting, it takes some time for coredns to make that pod's IP resolvable via the service's name.
The hostname is only unresolved for 2~3 seconds, why it takes about 5 minutes for the new member joining the cluster?
Very good question. This is something happening in etcd code. It seems that if a lookup fails, the result is cached for that amount of time, but I don't know exactly how.
could you please clarify the issue firstly, and follow collect-basic-info to collect & provide the required info?
Will do
but how does it know if it's a new cluster or not ?
It depends on the CLI flag --initial-cluster-state
. But if it has local WAL files, then the flag is ignored.
If it has local WAL files, you should can see log below, https://github.com/etcd-io/etcd/blob/816c2e2b8a00d8b47ed9013cf04a6b309cb1a28b/server/etcdserver/raft.go#L494-L497
If it doesn't have local WAL file, you should see log below, https://github.com/etcd-io/etcd/blob/816c2e2b8a00d8b47ed9013cf04a6b309cb1a28b/server/etcdserver/raft.go#L529-L534
The dns name used for the member is that of a headless service associated with the statefulset. I imagine that after the pod starting, it takes some time for coredns to make that pod's IP resolvable via the service's name.
The hostname is only unresolved for 2~3 seconds, why it takes about 5 minutes for the new member joining the cluster?
Very good question. This is something happening in etcd code. It seems that if a lookup fails, the result is cached for that amount of time, but I don't know exactly how.
could you please clarify the issue firstly, and follow collect-basic-info to collect & provide the required info?
Will do
any update on this?
I've had to turn my attention to more urgent issues. I will try to collect this info by end of next week.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
What happened?
See https://github.com/etcd-io/etcd/pull/13224#discussion_r1062216483
The correct logic should be: