envoyproxy / gateway

Manages Envoy Proxy as a Standalone or Kubernetes-based Application Gateway
https://gateway.envoyproxy.io
Apache License 2.0
1.58k stars 341 forks source link

Handle EnvoyProxy Image version upgrades #1712

Closed arkodg closed 5 months ago

cnvergence commented 1 year ago

I am interested in picking this up :)

arkodg commented 1 year ago

thanks @cnvergence ! Thinking out loud, an outcome of this issue could be a E2E where all client requests are successful while 2 replicas of Envoy Proxy are undergoing a rolling restart.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

cnvergence commented 1 year ago

@arkodg coming back to this after a while, could you please point me to where should I start? I did check the upgrade and it seems like it is handled, but I may be wrong.

As for the E2E, I assume I should add a new scenario to the e2e test suite :)

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

arkodg commented 1 year ago

hey I know @chauhanshubham was looking into a similar test for control plane upgrades which would invariably also upgrade envoy proxy, should we just collapse those two e2e tests into one where we perform an upgrade with a last known EG minor version, and ensure that

  1. there is no config churn in the data plane during an upgrade
  2. there is no traffic drop in the data plane while this upgrade this happens (assuming we always have 2 replicas of control plane and data plane running)
github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

guydc commented 8 months ago

I'm concerned that a hitless in-place upgrade of envoy is not trivial.

A graceful termination of envoy may require:

It's also important to avoid race conditions where a new instance of envoy is receiving traffic before it was configured (e.g. due to order of component restart, failures in new control plane version, etc.).

Some prior art:

guydc commented 8 months ago

@arkodg

I executed a naive test:

The upgrade caused some client-facing failures during the test:

Error distribution:
  [8]   Get "http://172.18.255.200/": EOF
  [32]  Get "http://172.18.255.200/": dial tcp 172.18.255.200:80: connect: connection refused
  [1]   Get "http://172.18.255.200/": read tcp 172.18.0.1:55220->172.18.255.200:80: read: connection reset by peer
  [1]   Get "http://172.18.255.200/": read tcp 172.18.0.1:55260->172.18.255.200:80: read: connection reset by peer

It's probably possible to tune some of the parameters mentioned in my previous comment to achieve a hitless upgrade under certain test conditions (RPS, connection reuse, HTTP version, ...). But, I'm not sure that we can claim to have a hitless upgrade in general, based on such test.

So, I propose that for the GA scope, we focus on an upgrade test that ensures request convergence to successful execution after the upgrade. A limited hitless upgrade test can be a stretch-goal.

In the future, we can explore:

WDYT?

arkodg commented 8 months ago

hey @guydc I was hoping we could have some test for hitless upgrade in v1.0, with caveats, that can hopefully we removed over time post GA do agree, we can split this up, and make it a stretch goal for v1.0

arkodg commented 7 months ago

this should be fixed with #2633, keeping this open so that it can be validated with a e2e

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

arkodg commented 5 months ago

fixed with https://github.com/envoyproxy/gateway/pull/2862