ahrtr commented 1 year ago

What happened?

See https://github.com/etcd-io/etcd/pull/13224#discussion_r1062216483

The correct logic should be:

If the URL doesn't contains wildcard, keep current behavior; otherwise follow https://github.com/etcd-io/etcd/pull/13224#discussion_r674022785

tjungblu commented 1 year ago

Thanks Ben, I think we need to discuss what behavior we want to fix. Clearly this is a regression in 3.5 (and 3.4?), but implicitly blocking the bootstrap config on resolving a DNS entry in an equality check is also not a great idea.

I would rather have the config.go explicitly lookup the important URLs and fail with a meaningful error message when they can't be resolved (akin to the init container the hypershift team added in https://github.com/openshift/hypershift/pull/1985). Then validate whether the URLs and their IP configs are consistent and equal. Potentially with a cmd flag to turn off that check.

7798 does have some merit after all, there are implications on cluster membership as well.

ahrtr commented 1 year ago

we've seen troubles with wildcard dns names matching the server name early

In case I misunderstood anything, could you clarify this? How did wildcard dns names matching the server name early happen? Do you have detailed steps to reproduce this issue?

tjungblu commented 1 year ago

I only have fragments of the whole issue myself, I'll come back with a reproducer in a test case once I receive more information later today.

As far as I understood, the issue stems from the fact that the URLs somehow match this logic: https://github.com/etcd-io/etcd/pull/14573/files#diff-669913e187d4dc726dcb9e01a5666ae0562549545138d25d2c8fc2ceb3e23d30R151-R165

My assumption was the wildcard is ignored, thus matching only on the host which is equal, which in turn will not resolve the DNS anymore. That previously would've blocked for a while to resolve, thus "preventing" etcd from starting.

ahrtr commented 1 year ago

It seems not related to https://github.com/etcd-io/etcd/pull/13224. The function urlsEqual just compares two slices of URL, and skip resolving DNS if they are deep equal. The logic seems simple and correct to me.

Anyway, please clarify the issue and provide more detailed info firstly. FYI. https://github.com/ahrtr/etcd-issues/blob/master/docs/troubleshooting/sanity_check_and_collect_basic_info.md

tjungblu commented 1 year ago

The logic seems simple and correct to me.

The logic is not the issue, it's the ordering that breaks an implicit assumption users had about the startup of etcd.

tjungblu commented 1 year ago

Here's how the team runs etcd inside k8s:

 /usr/bin/etcd
        --data-dir=/var/lib/data
        --name=etcd-0
        --initial-advertise-peer-urls=https://etcd-0.etcd-discovery.clusters-cewong-guest.svc:2380
        --listen-peer-urls=https://10.137.0.28:2380
        --listen-client-urls=https://10.137.0.28:2379,https://localhost:2379
        --advertise-client-urls=https://etcd-0.etcd-client.clusters-cewong-guest.svc:2379
        --listen-metrics-urls=https://0.0.0.0:2382
        --initial-cluster-token=etcd-cluster
        --initial-cluster=etcd-0=https://etcd-0.etcd-discovery.clusters-cewong-guest.svc:2380,etcd-1=https://etcd-1.etcd-discovery.clusters-cewong-guest.svc:2380,etcd-2=https://etcd-2.etcd-discovery.clusters-cewong-guest.svc:2380
        --initial-cluster-state=new
        --quota-backend-bytes=4294967296
        --snapshot-count=10000
        --peer-client-cert-auth=true
        --peer-cert-file=/etc/etcd/tls/peer/peer.crt
        --peer-key-file=/etc/etcd/tls/peer/peer.key
        --peer-trusted-ca-file=/etc/etcd/tls/etcd-ca/ca.crt
        --client-cert-auth=true
        --cert-file=/etc/etcd/tls/server/server.crt
        --key-file=/etc/etcd/tls/server/server.key
        --trusted-ca-file=/etc/etcd/tls/etcd-ca/ca.crt

If you step through it with a debugger, you notice the path in config.go trying to validate the peer urls: https://github.com/etcd-io/etcd/blob/6200b22f79abb6e9c4e6126f72ded616d85546c4/server/config/config.go#L255

in the above case, the urls that are compared are equal as:

https://etcd-0.etcd-client.clusters-cewong-guest.svc:2379

which is correct, previously the method netutil.URLStringsEqual would block until the IP resolves. Since 3.5.6 it just continues to start the server.

That then causes errors on the listener_tls.go that looks like this:

{"level":"warn","ts":"2022-12-13T11:29:51.254Z","caller":"embed/config_logging.go:160","msg":"rejected connection",
"remote-addr":"10.132.1.27:42760",
"server-name":"etcd-0.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc",
"ip-addresses":[],
"dns-names":[
    "*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc",
    "*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc.cluster.local",
    "127.0.0.1",
    "::1"],

"error":"tls: \"10.132.1.27\" does not match any of DNSNames [\"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc\" \"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc.cluster.local\" \"127.0.0.1\" \"::1\"]"}

Which led us to believe it was an issue with the wildcard matching - apologies for the red herring. It's merely a difference in waiting for DNS resolution.

tjungblu commented 1 year ago

here's my attempt to make this a bit more explicit during bootstrap: https://github.com/etcd-io/etcd/pull/15064

The startup would then fail with:

{"level":"warn","ts":"2023-01-05T16:33:32.901+0100","caller":"netutil/netutil.go:121","msg":"failed to resolve URL Host","url":"https://etcd-0.etcd-discovery.clusters-cewong-guest.svc:2380","host":"etcd-0.etcd-discovery.clusters-cewong-guest.svc:2380","retry-interval":"1s","error":"lookup etcd-0.etcd-discovery.clusters-cewong-guest.svc: no such host"}

ahrtr commented 1 year ago

Have you confirmed that https://github.com/etcd-io/etcd/pull/15064 can resolve your issue?

Based on the error message above in https://github.com/etcd-io/etcd/issues/15062#issuecomment-1372324491, the etcd server rejected the client connection due to client hostname mismatching the DNSName in certificate. FYI. https://etcd.io/docs/v3.5/op-guide/security/#notes-for-tls-authentication

Have you changed anything recently (e.g. client certificate)? If not, probably does it mean the you just need to wait some time for the reverse lookup to work?

ahrtr commented 1 year ago

"error":"tls: \"10.132.1.27\" does not match any of DNSNames [\"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc\" \"*.etcd-discovery.e2e-clusters-8f75l-example-54cx2.svc.cluster.local\" \"127.0.0.1\" \"::1\"]"}

Did you see this error only in a short period after the etcd gets started or all the time?

tjungblu commented 1 year ago

@csrwng can you quickly pitch in? you validated it yesterday with this PR with success.

csrwng commented 1 year ago

@ahrtr The error occurs continuously for about 5 minutes and then eventually resolves itself. This looks to be a timing issue. If a new cluster member starts attempting to contact its peers without waiting ~2-3 secs for its name to be resolvable, then peers will start reporting the error above and will not accept traffic from the new member until after 5 minutes have passed. Having the check that ensures the name is resolvable before starting the member prevents the communication errors from the peer.

I have confirmed that this change does fix the issue.

ahrtr commented 1 year ago

The error occurs continuously for about 5 minutes and then eventually resolves itself. This looks to be a timing issue.

Do you mean etcd keeps being restarted (failing due to the error "does not match any of DNSNames", and gets started again soon; then fails again...) for about 5 minutes? Have you figured out the root cause why the error occurs continuously for about 5 minutes?

It looks like a DNS issue to me. Adding an init-container to make sure the service (e.g. etcd-0.etcd-discovery.clusters-cewong-guest.svc) is resolvable is a correct workaround to me; but I'd suggest to figure out the root cause why the service isn't resolvable in the beginning about 5 minutes.

Just to double confirm... you only upgraded to the etcd 3.5.6, and did not change anything else, then you ran into this issue? Have you changed anything else (e.g client or peer certificate, or DNS settings)?

tjungblu commented 1 year ago

It looks like a DNS issue to me.

While I agree it's a DNS issue, the etcd behavior has changed with the patch release due to the backport. I'm not sure anyone else is relying on that, but it would still be good if the 3.5 releases could behave the same way.

csrwng commented 1 year ago

Do you mean etcd keeps being restarted (failing due to the error "does not match any of DNSNames", and gets started again soon; then fails again...) for about 5 minutes?

No, the error is reported by an established cluster member as a new member tries to join. And yes the error occurs for about 5 minutes, even though the dns name was not resolvable only for about 2-3 secs.

Just to double confirm... you only upgraded to the etcd 3.5.6, and did not change anything else, then you ran into this issue?

Yes

I created a couple of diagrams to better explain what is happening:

ahrtr commented 1 year ago

Thanks @csrwng for the feedback. But I am confused.

Previously @tjungblu provided error message "does not match any of DNSNames", but you provided error message "failed to resolve URL Host". Are you talking about two different issues?
In @tjungblu 's PR https://github.com/etcd-io/etcd/pull/15064, the method (*ServerConfig) advertiseMatchesCluster() is modified, but it's only called in (*ServerConfig)VerifyBootstrap, which is only called in the case of No local WAL files and new cluster (Please ignore the etcdutl use case). Your case is related to a member joining an existing cluster, how can the PR fix your issue?

The two biggest concerns are:

Why the hostname is unresolved for about 2~3 seconds?
The hostname is only unresolved for 2~3 seconds, why it takes about 5 minutes for the new member joining the cluster?

In order to avoid too much back and forth session, could you please clarify the issue firstly, and follow collect-basic-info to collect & provide the required info?

csrwng commented 1 year ago

Previously @tjungblu provided error message "does not match any of DNSNames", but you provided error message "failed to resolve URL Host". Are you talking about two different issues?

Sorry that is my bad, I used the wrong error message in my diagram. It is "does not match any of DNSNames".

Your case is related to a member joining an existing cluster, how can the PR fix your issue?

This is a cluster defined by a statefulset. All members start at roughly the same time. There is no local WAL file when members start, but how does it know if it's a new cluster or not ?

Why the hostname is unresolved for about 2~3 seconds?

The dns name used for the member is that of a headless service associated with the statefulset. I imagine that after the pod starting, it takes some time for coredns to make that pod's IP resolvable via the service's name.

The hostname is only unresolved for 2~3 seconds, why it takes about 5 minutes for the new member joining the cluster?

Very good question. This is something happening in etcd code. It seems that if a lookup fails, the result is cached for that amount of time, but I don't know exactly how.

could you please clarify the issue firstly, and follow collect-basic-info to collect & provide the required info?

Will do

ahrtr commented 1 year ago

but how does it know if it's a new cluster or not ?

It depends on the CLI flag --initial-cluster-state. But if it has local WAL files, then the flag is ignored.

If it has local WAL files, you should can see log below, https://github.com/etcd-io/etcd/blob/816c2e2b8a00d8b47ed9013cf04a6b309cb1a28b/server/etcdserver/raft.go#L494-L497

If it doesn't have local WAL file, you should see log below, https://github.com/etcd-io/etcd/blob/816c2e2b8a00d8b47ed9013cf04a6b309cb1a28b/server/etcdserver/raft.go#L529-L534

ahrtr commented 1 year ago

The dns name used for the member is that of a headless service associated with the statefulset. I imagine that after the pod starting, it takes some time for coredns to make that pod's IP resolvable via the service's name.

The hostname is only unresolved for 2~3 seconds, why it takes about 5 minutes for the new member joining the cluster?

Very good question. This is something happening in etcd code. It seems that if a lookup fails, the result is cached for that amount of time, but I don't know exactly how.

could you please clarify the issue firstly, and follow collect-basic-info to collect & provide the required info?

Will do

any update on this?

csrwng commented 1 year ago

I've had to turn my attention to more urgent issues. I will try to collect this info by end of next week.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

etcd-io / etcd

`urlsEqual` might wrongly skip resolving DNS #15062

What happened?

7798 does have some merit after all, there are implications on cluster membership as well.