envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.71k stars 4.75k forks source link

The requests may fail when on demand CDS returns clusters #20873

Open lambdai opened 2 years ago

lambdai commented 2 years ago

Description:

Upon OnDemand cluster returns an available cluster, the requests waiting for that cluster may fail due to hosts are not added to the cluster.

Detected by Test case TcpProxyOdcdsIntegrationTest, SingleTcpClient https://github.com/envoyproxy/envoy/blob/main/test/integration/tcp_proxy_odcds_integration_test.cc#L130

Background Currently a cluster is fully functional after cluster is warmed up and host members is propagated to worker thread.

The former enables obtain a ThreadLocalCluster by the name of the cluster. The latter supports LB when a upstream connection is needed by a router.

Prior to on-demand CDS, the two phases are distinguished by the error details but not many users need to understand the concrete reason.

However, in on-demand CDS, the expectation is a little different. The downstream filter is expected to be waiting until the cluster is fully ready.

Root Cause

From main thread perspective, the first host member update and the resumption of the router filter are concurrent at work threads. Chances are the router filter are resumed before the first member is delivered, thus the first bunch of requests using on-demand CDS are failing because of "no healthy upstream host".

Proposal I am considering adding another API to cluster manager, namely

NewHostCallback ThreadLocalCluster::waitForNewHost()

This new function can be deemed as an extended ClusterDiscoveryCallbackHandlePtr requestOnDemandClusterDiscovery() that addressed the issue.

The current requestOnDemandClusterDiscovery calls this new waitForNewHost() and hide the details of first host update.

This API could be adopted even if the cluster is not on-demand. There are known cases that all the hosts are removed during the cluster update and retry policy is not helping

Alternatives Consider the above unlucky sequences as a known failure and improve the each retry policy (of each protocol) to handle it. Currently TcpProxy and HCM fail fast on this condition.

alyssawilk commented 2 years ago

cc @adisuissa @htuch

htuch commented 2 years ago

@krnowak