contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.76k stars 229 forks source link

After a few hours of uptime, Faktory becomes unreachable for clients #382

Closed worm-emoji closed 2 years ago

worm-emoji commented 3 years ago

After a period of time, the Faktory service becomes unreachable. All network I/O fails – the EC2 instance itself is otherwise fine. Can also access Faktory's web UI totally fine – all workers are missing.

The workers print logs like this:

{"level":"error","payload":["heartbeat error: dial tcp 172.31.53.238:7419: i/o timeout"],"time":1633504909}
# Server
redis_version:6.0.5
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:0
redis_mode:standalone
os:Amazon ElastiCache
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:0.0.0
process_id:1
run_id:280d674b855988b8a8a47666543d2266658073c7
tcp_port:6379
uptime_in_seconds:2518036
uptime_in_days:29
hz:10
configured_hz:10
lru_clock:6115538
executable:-
config_file:-

# Clients
connected_clients:6
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

# Memory
used_memory:298047160
used_memory_human:284.24M
used_memory_rss:396111872
used_memory_rss_human:377.76M
used_memory_peak:3046244416
used_memory_peak_human:2.84G
used_memory_peak_perc:9.78%
used_memory_overhead:6978400
used_memory_startup:4536512
used_memory_dataset:291068760
used_memory_dataset_perc:99.17%
allocator_allocated:298972056
allocator_active:333066240
allocator_resident:396783616
used_memory_lua:59392
used_memory_lua_human:58.00K
used_memory_scripts:3168
used_memory_scripts_human:3.09K
number_of_cached_scripts:4
maxmemory:10527885773
maxmemory_human:9.80G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1.11
allocator_frag_bytes:34094184
allocator_rss_ratio:1.19
allocator_rss_bytes:63717376
rss_overhead_ratio:1.00
rss_overhead_bytes:-671744
mem_fragmentation_ratio:1.33
mem_fragmentation_bytes:98066136
mem_not_counted_for_evict:0
mem_replication_backlog:1048576
mem_clients_slaves:34484
mem_clients_normal:103452
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:897259438
rdb_bgsave_in_progress:0
rdb_last_save_time:1630987454
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:1933312
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# SSL
ssl_enabled:no
ssl_current_certificate_not_before_date:(null)
ssl_current_certificate_not_after_date:(null)
ssl_current_certificate_serial:(null)

# Stats
total_connections_received:53358
total_commands_processed:7129959442
instantaneous_ops_per_sec:58
total_net_input_bytes:3143691721414
total_net_output_bytes:527837873815
instantaneous_input_kbps:16.45
instantaneous_output_kbps:70.25
rejected_connections:0
sync_full:1
sync_partial_ok:2
sync_partial_err:1
expired_keys:16839951
expired_stale_perc:3.38
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:141774
evicted_keys:0
keyspace_hits:353490374
keyspace_misses:2442364821
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:10749
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0

# Replication
role:master
connected_slaves:2
slave0:ip=10.37.0.173,port=6379,state=online,offset=1038440497024,lag=1
slave1:ip=10.37.0.88,port=6379,state=online,offset=1038440497153,lag=1
master_replid:f7f160cdf905b7d080590b1da679b4ddc701118e
master_replid2:519f6e6531fbfd6e4755af24f9cf653c1f17b691
master_repl_offset:1038440522241
second_repl_offset:863271465602
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:1038439473666
repl_backlog_histlen:1048576

# CPU
used_cpu_sys:23460.471214
used_cpu_user:44697.677385
used_cpu_sys_children:1.289438
used_cpu_user_children:7.026939

# Modules

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=11505,expires=11155,avg_ttl=12538549

Disk Usage
> df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root       249G   14G  235G   6% /
devtmpfs         16G     0   16G   0% /dev
tmpfs            16G     0   16G   0% /dev/shm
tmpfs           3.1G  848K  3.1G   1% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/loop0       34M   34M     0 100% /snap/amazon-ssm-agent/3552
/dev/loop1       25M   25M     0 100% /snap/amazon-ssm-agent/4046
/dev/loop2       56M   56M     0 100% /snap/core18/2128
/dev/loop3       62M   62M     0 100% /snap/core20/1081
/dev/loop4       33M   33M     0 100% /snap/snapd/13170
/dev/loop5       33M   33M     0 100% /snap/snapd/12883
/dev/loop6       68M   68M     0 100% /snap/lxd/21545
/dev/loop7       56M   56M     0 100% /snap/core18/1997
/dev/loop8       71M   71M     0 100% /snap/lxd/21029

Faktory's logs are even totally ok while clients can't connect:

Oct 06 07:32:55 faktory[189443]: I 2021-10-06T07:32:55.453Z Retries processed 19 jobs
Oct 06 07:33:00 faktory[189443]: I 2021-10-06T07:33:00.451Z Retries processed 15 jobs
Oct 06 07:33:05 faktory[189443]: I 2021-10-06T07:33:05.456Z Retries processed 23 jobs

(New logs are being appended to this).

Would love to know things to try here :)

worm-emoji commented 3 years ago

Here are the last set of errors emitted by Faktory:

2021-10-05T23:22:26.049Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.37.159:58005: read: connection reset by peer
2021-10-05T23:22:26.049Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.37.159:48332: read: connection reset by peer
2021-10-05T23:22:26.049Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.37.159:54176: read: connection reset by peer
2021-10-05T23:22:26.049Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.37.159:23729: read: connection reset by peer
2021-10-05T23:22:26.049Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.37.159:1559: read: connection reset by peer
2021-10-05T23:22:26.049Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.37.159:38166: read: connection reset by peer
2021-10-05T23:22:26.049Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.37.159:5503: read: connection reset by peer
2021-10-05T23:22:26.049Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.37.159:20218: read: connection reset by peer
2021-10-05T23:50:15.557Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.6.66:33557: use of closed network connection
2021-10-05T23:50:15.557Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.78.217:47040: use of closed network connection
2021-10-05T23:50:30.553Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.60.205:35858: use of closed network connection
2021-10-05T23:50:30.553Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:1091: use of closed network connection
2021-10-05T23:50:30.553Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.2.124:52490: use of closed network connection
2021-10-05T23:50:30.553Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.28.14:49080: use of closed network connection
2021-10-05T23:50:30.553Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.53.162:18402: use of closed network connection
2021-10-05T23:50:30.553Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.4.0:61214: use of closed network connection
2021-10-05T23:50:30.553Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.86.243:10521: use of closed network connection
2021-10-05T23:50:30.553Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.25.235:1165: use of closed network connection

However, when I spin up new workers, I still can't connect. Need to restart faktory.

mperham commented 3 years ago

Hmm, that's worrying. I/O timeout from the heartbeat can happen if you have a license for 100 connections but you are trying to use many hundreds more worker connections. But v1.5.4 should print a warning if you are using more connections than licensed.

I will spend some time trying to reproduce the issue. Let me know if you get any more data or leads...

worm-emoji commented 3 years ago

Yeah, my main thought here is that is it possible that Faktory is holding on to connections longer than needed. I was looking into some stuff last night, not sure how relevant this link still is but putting it here as a potential resource: https://www.zombiezen.com/blog/2018/01/canceling-io-in-go-capnproto/

mperham commented 3 years ago

I've found a bug which causes Faktory to lose track of connections from workers, leading Faktory to think that the worker has no more connections and thus deleting its heartbeat and disallowing new connections. I suspect this is the root of your issue. Fix coming today.

mperham commented 3 years ago

Is it possible for you to build master, to see if verify that fixes the issue for you?

worm-emoji commented 3 years ago

Yes – how do I go about building master and preserving faktory enterprise?

mperham commented 3 years ago

Ah right, that's going to be a problem. I can build you a binary. Can you run that binary directly or do you need a Docker image? ARM or x86_64?

worm-emoji commented 3 years ago

I can run the binary directly. x86_64. I'll send you an email right now so you can deliver it to me :)

worm-emoji commented 3 years ago

Ran the binary with the change (pretty confident it has the change because the workers now have blue labels indicating the worker library, which is different than what was shipped in the .deb). After a couple of hours, still seeing new clients get in a stuck state:

{"level":"info","time":1633562899,"message":"faktory_worker_go 1.5.0 PID 1 now ready to process jobs"}
{"level":"error","payload":["heartbeat error: dial tcp 172.31.53.238:7419: i/o timeout"],"time":1633562915}
{"level":"error","payload":["heartbeat error: dial tcp 172.31.53.238:7419: i/o timeout"],"time":1633562930}
{"level":"error","payload":["heartbeat error: dial tcp 172.31.53.238:7419: i/o timeout"],"time":1633562945}
{"level":"error","payload":["heartbeat error: dial tcp 172.31.53.238:7419: i/o timeout"],"time":1633562960}
{"level":"error","payload":["heartbeat error: dial tcp 172.31.53.238:7419: i/o timeout"],"time":1633562975}
{"level":"error","payload":["heartbeat error: dial tcp 172.31.53.238:7419: i/o timeout"],"time":1633562990}
worm-emoji commented 3 years ago

The interesting thing about this change however, is that the Use of closed network connection logs are no longer present. Here's the entirety of the errors on this deployment of Faktory with this change:

E 2021-10-06T22:24:43.545Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:63759: read: connection reset by peer
E 2021-10-06T23:28:18.816Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:1688: read: connection reset by peer
E 2021-10-06T23:28:18.816Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:42386: read: connection reset by peer
E 2021-10-06T23:28:18.816Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:32964: read: connection reset by peer
E 2021-10-06T23:28:18.816Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:14348: read: connection reset by peer
E 2021-10-06T23:28:18.816Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:35727: read: connection reset by peer
E 2021-10-06T23:28:18.816Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:51147: read: connection reset by peer
E 2021-10-06T23:28:18.816Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:44075: read: connection reset by peer
E 2021-10-06T23:28:18.816Z Unexpected socket error: read tcp 172.31.53.238:7419->192.168.62.50:46036: read: connection reset by peer
worm-emoji commented 3 years ago

I looked a bit closer at your commit that fixes this, and my unsolicited 0.02 is that I think you might not be able to just defer the close – you might need to read the value for an error and handle this case differently. Not an expert on Faktory or Golang I/O so feel free to ignore!

mperham commented 3 years ago

I've been processing 5 jobs/sec for 2 hours now with no errors or issues that I can see. My script creates an assortment of normal jobs, unique jobs, a cron job or two and batches. You might need to give more context or create a repro app if I can't reproduce the issue locally.

worm-emoji commented 3 years ago

I can try to do that. I think the way to accelerate this behavior would to make a worker job that panics.

mperham commented 2 years ago

Please open a new issue if you can repro. Closing for now...

jkhas8 commented 1 year ago

Could you let me know what you did to solve this? I have the same issue now!

worm-emoji commented 1 year ago

I couldn't figure it out; we moved off Faktory to solve it