emissary-ingress / emissary

open source Kubernetes-native API gateway for microservices built on the Envoy Proxy
https://www.getambassador.io
Apache License 2.0
4.39k stars 688 forks source link

TLS handshake error: connection reset by peer or EOF #3004

Open nagarajatantry opened 4 years ago

nagarajatantry commented 4 years ago

Describe the bug During performance test, I have enabled ambassador pods and my upstream service to scale up when it breaches the 60% cpu threshold. When the scale up events are performed in both ambassador and upstream pods at the same time then i start seeing 503 errors with the below log message in my upstream service (Go). This does not happen when either ambassador or upstream service is pre-scaled.

2020/10/05 14:02:56 http: TLS handshake error from 100.122.153.167:56140: read tcp 100.99.240.5:9098->100.122.153.167:56140: read: connection reset by peer
2020/10/05 14:02:56 http: TLS handshake error from 100.122.153.167:58258: EOF

To Reproduce

Expected behavior Scale up events without errors.

Versions (please complete the following information):

Additional context I have tested with different setups.

  1. AWS ALB --> Ambassador Node Port --> Ambassador Pods --> Upstream NodePort --> Upstream Service
  2. AWS NLB --> Ambassador Pods --> Upstream NodePort --> Upstream Service
  3. AWS ALB --> Upstream NodePort --> Upstream Service (No AMbassador)

In case of 1 and 2, i see upwards of 10k (proportionate to the tps) 503 errors and the below error message in upstream logs . I dont see this issue when ambassador is not in the path (set up 3)

2020/10/05 14:02:56 http: TLS handshake error from 100.122.153.167:56140: read tcp 100.99.240.5:9098->100.122.153.167:56140: read: connection reset by peer
2020/10/05 14:02:56 http: TLS handshake error from 100.122.153.167:58258: EOF
nagarajatantry commented 4 years ago

Any input on this?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

wissam-launchtrip commented 3 years ago

@tannaga How did you end up resolving this?

realfresh commented 1 year ago

I'm seeing a ton of these errors, about 10K error lines in the last 2 hours, and it's not even a production cluster. This is running on GKE.

Zebradil commented 1 year ago

I'm observing the same issue on GKE. Restarting pods helps, but the issue re-appears from time to time.

rishabhparikh commented 5 months ago

We're observing this on GKE too.

kflynn commented 5 months ago

Huh, @rishabhparikh and @Zebradil, what version of Emissary are you using?

Zebradil commented 5 months ago

Hi @kflynn, one and a half year ago we were evaluating emissary ingress and saw this issue. But as we decided to go with another solution, I don't have any additional information on this issue anymore.

kflynn commented 5 months ago

@Zebradil Thanks -- I meant to tag the folks who'd recently commented on this issue, and misread the year for you, mea culpa!