connection reset by peer

rakeshkarnati1 commented 3 years ago

Which Faktory package and version? Faktory version 1.4.2
Which Faktory worker package and version? faktory_worker_go v1.4.2
Please include any relevant worker configuration We are still in the intial phase of testing faktory with our environment. We have a Faktory worker running locally. We also have a Faktory server running on AWS EKS which is deployed through helm chart. Our go client that connects over tcp+tls, which succeeds over port 7419. We are able to connect to the EKS server, but we are seeing timeout issues while pushing to the queue.

Timeout errors on workers

2021/05/10 19:16:17.457966 heartbeat error: EOF
2021/05/10 19:16:18.461315 heartbeat error: dial tcp 172.18.76.56:7419: i/o timeout
panic: EOF

Server Logs

Bad connection: read tcp 100.69.3.226:7419->172.18.76.38:49976: read: connection reset by peer

Are you using an old version? Yes Have you checked the changelogs to see if your issue has been fixed in a later version? Yes https://github.com/contribsys/faktory/blob/master/Changes.md https://github.com/contribsys/faktory/blob/master/Pro-Changes.md https://github.com/contribsys/faktory/blob/master/Ent-Changes.md

mperham commented 3 years ago

Are you using an old version? No

In fact, you are. 1.5.1 is the latest version. Please try the latest.

rakeshkarnati1 commented 3 years ago

That didn't help either.

2021/05/11 21:26:00.359638 dial tcp 172.18.76.56:7419: i/o timeout
2021/05/11 21:26:00.663683 dial tcp 172.18.77.38:7419: i/o timeout
2021-05-11T16:26:00-05:00 ERRO             dial tcp 172.18.77.38:7419: i/o timeout factory/factory.go:34
panic: dial tcp 172.18.77.38:7419: i/o timeout

mperham commented 3 years ago

I don’t know what’s wrong. There are several other io timeout issues, search and see if any cover your situation.

crogers-ori commented 2 years ago

Ignore all this, just saw a new "Bad connection: read tcp" log 😢

My last 10 logs have all been in the format

Bad connection: read tcp FaktoryServer:7419->PrivateNLB:36844: read: connection reset by peer

I do not have any workers/clients alive right now, and the UI shows 0 connections. Maybe it is the tcp health checks?

old

Yesterday I started working on deploying Faktory ent to our dev environment, as an ECS service with TLS in front of it.

I saw these same Bad connection: read tcp privateIP:7419->privateIP:24984: read: connection reset by peer this morning. I had a ruby and python worker running in docker on my local machine, pointing to a deployed Faktory server on AWS ECS.

I was able to run jobs just fine, with no errors on the client or workers.

At first I thought it might be the health checks from the target groups, but they seemed fine. So focusing on the port 7419, I shut down the clients and workers on my local machine (running in their own containers) and the logs in AWS ECS stopped.

My guess is that something (my VPN?) is not allowing the return traffic for some keep alive that the workers are sending.

If I was going to keep digging in I would look at the ruby or python worker code and see which of them has special keep alive logic for TLS, maybe?

crogers-ori commented 2 years ago

A couple additional thoughts, because I'm concerned this is going to cause issues in production.

My TLS connection is terminated at the NLB, but then the NLB -> ECS container is just TCP. Maybe the keep alive https://github.com/contribsys/faktory/blob/3d23fca667d9d459ead0a445cd9b8ab45b25cb08/client/client.go#L191 should be just TCP?
I originally deployed my DEV version with no TLS, no load balancer, just a route 53 entry straight to the private IP. Looking through days of those logs I don't see any "Bad connection: read tcp" errors. So I think the two significant differences are an AWS NLB using TLS.

Maybe I'll remove the TLS in dev (staging) to see if that fixes it.

crogers-ori commented 2 years ago

I replaced the TLS listener with a TCP:7419, redeployed the service (new container), and the logs continued.

So for now I have removed the NLB completely and I'm just pointing at the private ip. There have not been any Bad Connection logs at all since I made that change.

I guess that leaves the NLB as the problem? Maybe the way the server reaches out to clients is not compatible with AWS NLBs?

I don't know, but it would be nice to be able to let AWS manage the TLS and take advantage of NLB/target groups with an ECS service for resiliency.

mperham commented 2 years ago

Thanks for debugging and keeping us informed. I don’t have any good suggestions, is there any documentation on limitations on TLS or keepalive support in NLB?

crogers-ori commented 2 years ago

I didn't have any client/workers connected, so I think keepalive in this context is an outbound call from the Faktory server.

In fact, in my production cluster I have never even tested the connection string but these "Bad connection: read tcp" log show up every 5 minutes or so. That has to be the server itself sending outbound connections right?

So even though AWS defines some NLB keepalive specs here, https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout, I think they are using "client" to mean the thing sending inbound requests.

Based on your better understanding of the Faktory server architecture do you know of any reasons the Faktory server would be reaching outbound as the initiator (in a way that the NLB might not forward)? Is Faktory trying to find connected clients with a "ping", even if none are connected?

crogers-ori commented 2 years ago

When you think about health checks for a web server it makes sense to build a /healthcheck endpoint. Something simple that tells you the webserver is alive, routing works, and application code can run.

It occurred to me this morning that a TCP health check on port 7419 might have similar effect, except triggering some client/server communication that goes beyond a simple health check. On the target group side, the health check ping worked and the server is healthy. On the Faktory side it thinks a client is trying to initiate a handshake, but it never completes the entire process.

To test this theory I changed the target group that is forwarding to port 7419 to use an overriding port of 7420 for the TCP health check.

Super simple change, and my logs in production are dead quiet.

Sorry for all the confusion, hopefully this helps other configure the most optimal health checks for Faktory.

frankylee commented 2 years ago

To test this theory I changed the target group that is forwarding to port 7419 to use an overriding port of 7420 for the TCP health check.

@crogers-operationalresults Can you explain this in greater detail? I am seeing similar logs and am hoping this could be a solution for me as well.

crogers-ori commented 2 years ago

@frankylee because Faktory is exposing multiple ports on the same container I have 2 target groups connecting to that single ECS service. The first target group, 7419, is where I had to override the health check port to be 7420. The second target group, 7420, will use it's own port (7420). So both target groups use 7420 for health checks.

Here is an abbreviated output for aws elbv2 describe-target-groups --names dev-faktory-ui dev-faktory-server | jq ".TargetGroups". Notice the combinations of port and healthcheckport.

[
  {
    "TargetGroupName": "dev-faktory-server",
    "Protocol": "TCP",
    "Port": 7419,
    "HealthCheckProtocol": "TCP",
    "HealthCheckPort": "7420",
    "HealthCheckEnabled": true,
    "TargetType": "ip",
    "IpAddressType": "ipv4"
  },
  {
    "TargetGroupName": "dev-faktory-ui",
    "Protocol": "TCP",
    "Port": 7420,
    "HealthCheckProtocol": "TCP",
    "HealthCheckPort": "traffic-port",
    "HealthCheckEnabled": true,
    "TargetType": "ip",
    "IpAddressType": "ipv4"
  }
]

contribsys / faktory

connection reset by peer #354

old