Closed worm-emoji closed 2 years ago
Here are the last set of errors emitted by Faktory:
2021-10-05T23:22:26.049Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.37.159:58005: read: connection reset by peer
2021-10-05T23:22:26.049Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.37.159:48332: read: connection reset by peer
2021-10-05T23:22:26.049Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.37.159:54176: read: connection reset by peer
2021-10-05T23:22:26.049Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.37.159:23729: read: connection reset by peer
2021-10-05T23:22:26.049Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.37.159:1559: read: connection reset by peer
2021-10-05T23:22:26.049Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.37.159:38166: read: connection reset by peer
2021-10-05T23:22:26.049Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.37.159:5503: read: connection reset by peer
2021-10-05T23:22:26.049Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.37.159:20218: read: connection reset by peer
2021-10-05T23:50:15.557Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.6.66:33557: use of closed network connection
2021-10-05T23:50:15.557Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.78.217:47040: use of closed network connection
2021-10-05T23:50:30.553Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.60.205:35858: use of closed network connection
2021-10-05T23:50:30.553Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:1091: use of closed network connection
2021-10-05T23:50:30.553Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.2.124:52490: use of closed network connection
2021-10-05T23:50:30.553Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.28.14:49080: use of closed network connection
2021-10-05T23:50:30.553Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.53.162:18402: use of closed network connection
2021-10-05T23:50:30.553Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.4.0:61214: use of closed network connection
2021-10-05T23:50:30.553Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.86.243:10521: use of closed network connection
2021-10-05T23:50:30.553Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.25.235:1165: use of closed network connection
However, when I spin up new workers, I still can't connect. Need to restart faktory.
Hmm, that's worrying. I/O timeout from the heartbeat can happen if you have a license for 100 connections but you are trying to use many hundreds more worker connections. But v1.5.4 should print a warning if you are using more connections than licensed.
I will spend some time trying to reproduce the issue. Let me know if you get any more data or leads...
Yeah, my main thought here is that is it possible that Faktory is holding on to connections longer than needed. I was looking into some stuff last night, not sure how relevant this link still is but putting it here as a potential resource: https://www.zombiezen.com/blog/2018/01/canceling-io-in-go-capnproto/
I've found a bug which causes Faktory to lose track of connections from workers, leading Faktory to think that the worker has no more connections and thus deleting its heartbeat and disallowing new connections. I suspect this is the root of your issue. Fix coming today.
Is it possible for you to build master, to see if verify that fixes the issue for you?
Yes – how do I go about building master and preserving faktory enterprise?
Ah right, that's going to be a problem. I can build you a binary. Can you run that binary directly or do you need a Docker image? ARM or x86_64?
I can run the binary directly. x86_64. I'll send you an email right now so you can deliver it to me :)
Ran the binary with the change (pretty confident it has the change because the workers now have blue labels indicating the worker library, which is different than what was shipped in the .deb
). After a couple of hours, still seeing new clients get in a stuck state:
{"level":"info","time":1633562899,"message":"faktory_worker_go 1.5.0 PID 1 now ready to process jobs"}
{"level":"error","payload":["heartbeat error: dial tcp 172.31.53.238:7419: i/o timeout"],"time":1633562915}
{"level":"error","payload":["heartbeat error: dial tcp 172.31.53.238:7419: i/o timeout"],"time":1633562930}
{"level":"error","payload":["heartbeat error: dial tcp 172.31.53.238:7419: i/o timeout"],"time":1633562945}
{"level":"error","payload":["heartbeat error: dial tcp 172.31.53.238:7419: i/o timeout"],"time":1633562960}
{"level":"error","payload":["heartbeat error: dial tcp 172.31.53.238:7419: i/o timeout"],"time":1633562975}
{"level":"error","payload":["heartbeat error: dial tcp 172.31.53.238:7419: i/o timeout"],"time":1633562990}
The interesting thing about this change however, is that the Use of closed network connection
logs are no longer present. Here's the entirety of the errors on this deployment of Faktory with this change:
E 2021-10-06T22:24:43.545Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:63759: read: connection reset by peer
E 2021-10-06T23:28:18.816Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:1688: read: connection reset by peer
E 2021-10-06T23:28:18.816Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:42386: read: connection reset by peer
E 2021-10-06T23:28:18.816Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:32964: read: connection reset by peer
E 2021-10-06T23:28:18.816Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:14348: read: connection reset by peer
E 2021-10-06T23:28:18.816Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:35727: read: connection reset by peer
E 2021-10-06T23:28:18.816Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:51147: read: connection reset by peer
E 2021-10-06T23:28:18.816Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:44075: read: connection reset by peer
E 2021-10-06T23:28:18.816Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:46036: read: connection reset by peer
I looked a bit closer at your commit that fixes this, and my unsolicited 0.02 is that I think you might not be able to just defer the close – you might need to read the value for an error and handle this case differently. Not an expert on Faktory or Golang I/O so feel free to ignore!
I've been processing 5 jobs/sec for 2 hours now with no errors or issues that I can see. My script creates an assortment of normal jobs, unique jobs, a cron job or two and batches. You might need to give more context or create a repro app if I can't reproduce the issue locally.
I can try to do that. I think the way to accelerate this behavior would to make a worker job that panics.
Please open a new issue if you can repro. Closing for now...
Could you let me know what you did to solve this? I have the same issue now!
I couldn't figure it out; we moved off Faktory to solve it
After a period of time, the Faktory service becomes unreachable. All network I/O fails – the EC2 instance itself is otherwise fine. Can also access Faktory's web UI totally fine – all workers are missing.
Which Faktory package and version? Faktory Enterprise v1.5.4
Which Faktory worker package and version? faktory_worker_go v1.5.0
The workers print logs like this:
Faktory's logs are even totally ok while clients can't connect:
(New logs are being appended to this).
Would love to know things to try here :)