Minimal gRPC service-to-service config example with health checking

envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy

https://www.envoyproxy.io

Apache License 2.0

25.08k stars 4.82k forks source link

Minimal gRPC service-to-service config example with health checking #865

Open egoldschmidt opened 7 years ago

egoldschmidt commented 7 years ago

Envoy is impressive, and its documentation is far-reaching, but the wealth of configuration options means its difficult to know if you're setting things up right. In my case, all I want is automatic health-check driven connection draining to ensure zero-downtime deploys. Yet after setting up an SDS server, health check listeners, retry policies, and more I still see requests being forwarded to Envoy servers that are returning 503's.

Has anyone put together a minimal configuration example that provides what I'm looking for? I know that gRPC is a bit of a special case but I suspect many folks would love something simple to start with, as opposed to the extremely complex configs on the documentation website.

I should note that I'm running Envoy on both the client and server.

mattklein123 commented 7 years ago

We would like to have more examples, but unfortunately don't have resources to have someone work on refining all of the getting started/example documentation. Will leave this open to track us doing it at some point or perhaps someone contributing an example. In the meantime, we can help you debug on Gitter if you ask questions there.

fabianfett commented 7 years ago

@egoldschmidt Have you found anything? How did you solve your problem?

philbour commented 6 years ago

Has anyone got anywhere with this?

arkadyb commented 6 years ago

Were you able to get something to work? Im trying to build grpc service mesh.

cmluciano commented 6 years ago

Any chance we can enhance the grpc bridge example to satisfy this issue?

mattklein123 commented 6 years ago

@cmluciano optimally it would be a pure gRPC example IMO.

dio commented 6 years ago

I have a (too) simple setup example here: https://github.com/dio/simple-grpc will add xDS (probably CDS first) in the mix. After some cleanups, I'll try to submit a PR.

srikailash commented 6 years ago

@mattklein123, feel free to assign this to me.

dio commented 4 years ago

cc. @phlax

asaha123 commented 2 years ago

@phlax I am looking to work on this. If you have any suggestions let me know. I will share my plan before i start work.

phlax commented 2 years ago

i would take a look at the example in @dio's repo

not sure if we want/need to have TLS - i think the main thing is showing the gRPC proxy and health check - but perhaps im wrong and it is necessary for some reason

asaha123 commented 2 years ago

The example repo currently doesn't have a LICENSE so not sure how the code contribution will work @dio advise here when youy get a chance. Of course, can create my own examples.

To summarize, what should the sandbox demonstrate? Are they:

Static configuration (i.e. considers CDS out of scope) for gRPC service to service communication
- Will verify if TLS is required for some reason
Active healthchecking support

asaha123 commented 2 years ago

I have started working on this with two example gRPC services - hello and world custom made for this sandbox.

client (grpccurl) -> envoy -> hello -> envoy -> world

Will create a PR when I am ready.

Got into a busy period at work, but i am going to keep working on it.

asaha123 commented 2 years ago

Sharing some in progress work: https://github.com/atlassian-forks/envoy/tree/grpc_s2s/examples/grpc-s2s

I currently have a verification script: https://github.com/atlassian-forks/envoy/blob/314f274a62461a4fb1f52ad891363c0b0b79188f/examples/grpc-s2s/verify.sh - the comments sort of explain what is currently happening:

Bring up 2 instances of Hello and World grpc services
Mark one instance of each service unhealthy
Ensure that the request is forwarded to the remaining instance - Will add more verification and see if we can manipulate panic_threshold at runtime and demonstrate the difference in behavior

asaha123 commented 1 year ago

I created a draft PR: https://github.com/envoyproxy/envoy/pull/24674. I am not sure about a issue i am facing that i would like to fix before creating the documentation.

When a host is marked unhealthy, it seems like, Envoy updates the cluster statistics to mark the host as unhealthy only after 60s. I have configured Envoy to flush on admin request, but that doesn't seem to help. So, now I am not sure is it that the statistics are not updated or the host is actually not evicted from the load balancer's upstream pool.

I will keep digging.

asaha123 commented 1 year ago

The 60s delay noticed for eviction was happening due to the default value for no_traffic_interval (https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/health_check.proto#config-core-v3-healthcheck). By setting that I can now see host eviction almost immediately the host becomes unhealthy. https://github.com/envoyproxy/envoy/pull/24674/commits/695de5cd3d528e9751b61ddc2611a753fcce0b69#diff-3af0ad67fbabd75de88150efd1552588adcf5a6c1cfe6bf20b9e8b29fedb0b93