aws / aws-app-mesh-roadmap

AWS App Mesh is a service mesh that you can use with your microservices to manage service to service communication
Apache License 2.0
347 stars 25 forks source link

Bug: Unable to create a persistent TCP connection through socket #359

Open robinvdvleuten opened 3 years ago

robinvdvleuten commented 3 years ago

SECURITY NOTICE: If you think you’ve found a potential security issue, please do not post it in the Issues. Instead, please follow the instructions here or email AWS security directly.

Summary

We are trying to set up a persistent TCP connection between two ECS tasks; one containing a worker and one containing a Faktory server (https://github.com/contribsys/faktory). We can connect with the server task from within the worker task through App Mesh (envoy sidecar), as we verify it through curl and see the expected "greeting" from the server (+HI {"v":2,"i":5211,"s":"365a858149c6e2d1"}). But the worker keeps getting EOF errors on its connection.

Steps to Reproduce

Start one task with a worker (https://github.com/contribsys/faktory_worker_go) and one with a Faktory server. Configure both with the Envoy sidecar as proxy.

Are you currently working around this issue?

We cannot unfortunately.

Additional context

Not that I am aware of.

lydell-manganti-blake commented 3 years ago

We are getting a similar issue on our setup. We are using eks (v1.18), and implemented Ingress VirtualGateway in front of our apis. Our application is Ruby with the api in Elixir. We are seeing this error

Excon::Error::Socket: EOFError (EOFError)

However, the health check is returning 200 ok for both the app and api.

karanvasnani commented 3 years ago

Hi @robinvdvleuten, is it possible for you to share your mesh configurations specifically the Route spec for your backend virtual node? My current guess is that your setup is experiencing transient tcp connection errors which might not be getting retried on and instead are directly communicated back to the client in the form of EOF errors. We vend configuration to the Envoy to be resilient against these intermittent connection errors by default by setting a default retry policy to every route. However, if a customer applies a custom retry policy that overwrites the default one. Are you setting a custom retry policy? And if so, are you retrying on connection-error as documented here?

robinvdvleuten commented 3 years ago

@karanvasnani I just use the defaults provided by Amazon App Mesh and not using any routing, only the services and nodes.

karanvasnani commented 3 years ago

@robinvdvleuten thanks for the update. Since you mentioned that it's a persistent TCP connection and that there's no Routes configured, I would suspect the issue here could be the connection timeouts. By default an idle timeout of 300 seconds is applied to the connection between the downstream-upstream Envoy as well as the upstream Envoy-Application. You should be able to tell whether timeout is the issue by looking at the Envoy debug logs or stats. If that's the case you can configure this default timeout by setting up a Route and then adjusting the timeout on the Route as well as the backend VirtualNode.

lydell-manganti-blake commented 3 years ago

In my case with Excon::Error::Socket, we've identified the issue as a setting on TargetGroup. The client IP preservation is enabled by default. My setup is a NetworkLB with cross-zone load balancing (VirtualGateway). We are getting connectivity issues depending on the Availability Zone. IP address from AZa is consistently working, AZb does not work at all, AZc works/fails half of the time.

Once I've disabled the client IP preservation, we are now getting consistent connectivity without EOFError.

robinvdvleuten commented 3 years ago

@karanvasnani the socket immediately times out and never receives any data. So I don't see why changing the timeout of 300 seconds would help in this case. On some posts regarding Envoy, I read that Envoy expects the client socket immediately writes / sends, but in our case it waits for a handshake from the server.