Do not redo healthchecks for endpoints with different priorities

shulin-sq commented 1 month ago

Title: Do not redo healthchecks for endpoints with different priorities

Description:

Describe the desired behavior, what scenario it enables and how it would be used.

we use eds to deliver endpoints that represent different priorities. Sometimes the priorities would switch for an endpoint as we change which set of IPs should be prioritized in load balancing. During this action I've noticed that when an endpoint changes priority it becomes unhealthy.

# initially
myapp_cluster::10.180.179.183:443::health_flags::healthy
myapp_cluster::10.180.179.183:443::priority::0

# upon update of priorities
myapp_cluster::10.178.43.7:443::health_flags::/failed_active_hc
myapp_cluster::10.178.43.7:443::priority::0
myapp_cluster::10.180.179.183:443::health_flags::/failed_active_hc
myapp_cluster::10.180.179.183:443::priority::1

# eventually
myapp_cluster::10.178.43.7:443::health_flags::healthy
myapp_cluster::10.178.43.7:443::priority::0
myapp_cluster::10.180.179.183:443::health_flags::healthy
myapp_cluster::10.180.179.183:443::priority::1

Since both IPs in the cluster becomes unhealthy this creates a scenario where if I do not use panic routing, clients will see a no_healthy_upstream 503 error. This seems avoidable since in this example the ip 10.180.179.183 was already being healthchecked prior to the update.

Is it possible to carry over 10.180.179.183's healthcheck state even though its priority has changed?

related issues might be: https://github.com/envoyproxy/envoy/issues/4280 envoy version: 1.31.3

KBaichoo commented 1 month ago

This seems like a similar issue resolved by https://github.com/envoyproxy/envoy/pull/27900 which added preserving HCing information when an endpoint bounced between locality.

cc @adisuissa

adisuissa commented 1 month ago

To the best of my knowledge there is a test that validates this, so it should be working. It was added as part of #14483, so unless something changed it should work for your Envoy version.

I'm not sure if there's any other config knob that overrides the healthcheck status in this case. Further investigation is probably needed, and I suggest starting with an integration test that reproduces the bug.

shulin-sq commented 1 month ago

Thank you for the response and the links to the code. I don't see anything obvious that would cause this so I will work on a repro in a less complex environment.

shulin-sq commented 1 month ago

@adisuissa @KBaichoo

circling back to this

I am unfamiliar with how to write unit tests or functional tests so as I'm working that out, I thought I would just use the envoy you can install via brew to see if I can build a simple repro case.

envoy config: https://gist.github.com/shulin-sq/60d723009689b8f31400f87fff890393 control plane: https://github.com/shulin-sq/java-control-plane/commit/fa78081ef4f618f971ab5ce505dcb3f7455a2bfa running the test control plane implementation with some modifications to flip the priority between 0 and 1 io.envoyproxy.controlplane.server.TestMain envoy version: envoy version: 816188b86a0a52095b116b107f576324082c7c02/1.30.1/Distribution/RELEASE/BoringSSL

I have a video of the repro but it's too large to upload but in the screenshot you can see that the healthcheck failed even though in the control plane we're only changing the priority and not the endpoint address.

Screenshot 2024-08-06 at 3 41 48 PM

I also went back -1yr of envoy builds that I have handy as old as v1.25.4 and was able to reproduce the bug in our staging environment. unfortunately I was unable to go back to when the fix was first introduced so I cannot say if it was ever behaving in the way that I expected.

shulin-sq commented 1 month ago

digging more into this it seems that all_hosts here is empty https://github.com/envoyproxy/envoy/blob/ee567154320b11f54f2ad0453e9d620d910df4f4/source/common/upstream/upstream_impl.cc#L2272 from my functional test that I shared above which means that the previous endpoint is never found.

I'm wondering if this is a bug or if I've set up the test incorrectly to make it seem like a new cluster is added instead of just changing the endpoint

[2024-08-09 22:12:30.241][82232][info][upstream] [source/common/upstream/cds_api_helper.cc:32] cds: add 1 cluster(s), remove 1 cluster(s)
[2024-08-09 22:12:30.242][82232][debug][upstream] [source/common/upstream/cds_api_helper.cc:54] cds: add/update cluster 'httpbin' skipped
[2024-08-09 22:12:30.242][82232][info][upstream] [source/common/upstream/cds_api_helper.cc:71] cds: added/updated 0 cluster(s), skipped 1 unmodified cluster(s)
[2024-08-09 22:12:30.243][82232][debug][upstream] [source/common/upstream/upstream_impl.cc:469] transport socket match, socket default selected for host with address 34.194.112.169:80
[2024-08-09 22:12:30.244][82232][debug][upstream] [source/extensions/clusters/eds/eds.cc:428] EDS hosts or locality weights changed for cluster: httpbin current hosts 0 priority 0
[2024-08-09 22:12:30.244][82327][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82329][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82328][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82338][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82339][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82337][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82232][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82330][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82340][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82331][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82334][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 1 removed 0
[2024-08-09 22:12:30.244][82232][debug][upstream] [source/extensions/clusters/eds/eds.cc:428] EDS hosts or locality weights changed for cluster: httpbin current hosts 1 priority 1
[2024-08-09 22:12:30.244][82327][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.244][82329][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.244][82328][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.244][82330][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.244][82331][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82337][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82334][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82232][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82338][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82340][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
[2024-08-09 22:12:30.245][82339][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1550] membership update for TLS cluster httpbin added 0 removed 1
{"health_checker_type":"HTTP","host":{"socket_address":{"protocol":"TCP","address":"34.194.112.169","port_value":80,"resolver_name":"","ipv4_compat":false}},"cluster_name":"httpbin","add_healthy_event":{"first_check":true},"timestamp":"2024-08-09T22:12:25.243Z","locality":{"region":"","zone":"","sub_zone":""}}
{"health_checker_type":"HTTP","host":{"socket_address":{"protocol":"TCP","address":"34.194.112.169","port_value":80,"resolver_name":"","ipv4_compat":false}},"cluster_name":"httpbin","add_healthy_event":{"first_check":true},"timestamp":"2024-08-09T22:12:28.487Z","locality":{"region":"","zone":"","sub_zone":""}}

shulin-sq commented 1 month ago

Ok while debugging the code I was able to confirm that my repro is correct. The cluster churn is because of the xds_cluster that I added as an example which keeps trying to remove itself from the config

So far what I've seen is that it seems the issue is due to https://github.com/envoyproxy/envoy/blob/5b80c5d9a9e63bf274963862f43ab5b9b1e57f50/source/common/upstream/upstream_impl.cc#L975-L996

because all_hosts is populated by the crossPriorityHostMap https://github.com/envoyproxy/envoy/blob/5b80c5d9a9e63bf274963862f43ab5b9b1e57f50/source/common/upstream/upstream_impl.cc#L2269-L2272

crossPriorityHostMap does not have priority metadata in its internal storage. So say we are flipping the priority of 34.194.112.169, it is constantly being added and removed from crossPriorityHostMap

Before I attempt to introduce a PR I wonder if anyone more familiar with this code would recommend how we introduce a fix. I think in languages I'm more familiar with what I would do is keep crossPriorityHostMap as a mapping of priorities to host addresses, and when fetching it, return a set of addresses.

shulin-sq commented 3 weeks ago

Alright I think I have an idea on how to fix this and will submit a PR soon.

envoyproxy / envoy

Do not redo healthchecks for endpoints with different priorities #35243