TLS handshake error: connection reset by peer or EOF

nagarajatantry commented 4 years ago

Describe the bug During performance test, I have enabled ambassador pods and my upstream service to scale up when it breaches the 60% cpu threshold. When the scale up events are performed in both ambassador and upstream pods at the same time then i start seeing 503 errors with the below log message in my upstream service (Go). This does not happen when either ambassador or upstream service is pre-scaled.

2020/10/05 14:02:56 http: TLS handshake error from 100.122.153.167:56140: read tcp 100.99.240.5:9098->100.122.153.167:56140: read: connection reset by peer
2020/10/05 14:02:56 http: TLS handshake error from 100.122.153.167:58258: EOF

To Reproduce

Upstream Service exposing endpoint via https : Go service with a POST endpoint exposed. Nothing fancy. sleeps for few seconds and returns an empty body.
Install ambassador as mentioned in the documentation.
Ambassador Module and mapping setup is mentioned below

Expected behavior Scale up events without errors.

Versions (please complete the following information):

Ambassador: [1.7.3]
Kubernetes environment [in house]
Version [1.16]

Additional context I have tested with different setups.

AWS ALB --> Ambassador Node Port --> Ambassador Pods --> Upstream NodePort --> Upstream Service
AWS NLB --> Ambassador Pods --> Upstream NodePort --> Upstream Service
AWS ALB --> Upstream NodePort --> Upstream Service (No AMbassador)

In case of 1 and 2, i see upwards of 10k (proportionate to the tps) 503 errors and the below error message in upstream logs . I dont see this issue when ambassador is not in the path (set up 3)

2020/10/05 14:02:56 http: TLS handshake error from 100.122.153.167:56140: read tcp 100.99.240.5:9098->100.122.153.167:56140: read: connection reset by peer
2020/10/05 14:02:56 http: TLS handshake error from 100.122.153.167:58258: EOF

Module & mapping

---
apiVersion: getambassador.io/v2
kind: Module
metadata:
name: ambassador
spec:
config:
use_proxy_proto: true
diag_port: 8878
diagnostics:
  enabled: true  
keepalive:
  time: 100
  interval: 10
  probes: 3    
---
apiVersion: getambassador.io/v2
kind: Mapping
metadata:
labels:
app: jaeger
env: dev
name: otel-mapping
namespace: otel
spec:
circuit_breakers:
- max_connections: 1000000000
max_pending_requests: 1000000000
max_requests: 1000000000
cors:
credentials: true
headers:
- Content-Type
- Authorization
- Accept
- x-opentelemetry-outgoing-request
max_age: "86400"
methods:
- POST
- GET
- OPTIONS
origins:
- '*'
grpc: false
load_balancer:
header: sessionid
policy: ring_hash
retry_policy:
retry_on: "5xx"
num_retries: 2
prefix: /v1/trace
resolver: endpoint
rewrite: /v1/trace
service: https://otelsvc:9098

nagarajatantry commented 4 years ago

Any input on this?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

wissam-launchtrip commented 3 years ago

@tannaga How did you end up resolving this?

realfresh commented 1 year ago

I'm seeing a ton of these errors, about 10K error lines in the last 2 hours, and it's not even a production cluster. This is running on GKE.

Zebradil commented 1 year ago

I'm observing the same issue on GKE. Restarting pods helps, but the issue re-appears from time to time.

rishabhparikh commented 5 months ago

We're observing this on GKE too.

kflynn commented 5 months ago

Huh, @rishabhparikh and @Zebradil, what version of Emissary are you using?

Zebradil commented 5 months ago

Hi @kflynn, one and a half year ago we were evaluating emissary ingress and saw this issue. But as we decided to go with another solution, I don't have any additional information on this issue anymore.

kflynn commented 5 months ago

@Zebradil Thanks -- I meant to tag the folks who'd recently commented on this issue, and misread the year for you, mea culpa!

emissary-ingress / emissary

TLS handshake error: connection reset by peer or EOF #3004