contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.73k stars 227 forks source link

Error read tcp i/o timeout in Kubernetes #375

Closed aaroalan closed 2 years ago

aaroalan commented 3 years ago

Server: 1.5.2 (test project) and 1.4.2 (production) Client: 1.5.3

We have the Faktory server and workers running inside a Kubernetes cluster, we never detected any issue but after the last client update 1.5.3 an error starting to pop up in the worker logs (I think after this commit ) according to the Changes.md this is caused for a slow or misconfigured network. I have try different Kubernetes configs but still getting the error.

error log example:

2021/08/31 19:04:35.057781 read tcp 10.1.0.57:42730->10.100.34.52:7419: i/o timeout
2021/08/31 19:04:35.259068 read tcp 10.1.0.57:42732->10.100.34.52:7419: i/o timeout
2021/08/31 19:04:35.259837 read tcp 10.1.0.57:42734->10.100.34.52:7419: i/o timeout
2021/08/31 19:04:35.422802 read tcp 10.1.0.57:42736->10.100.34.52:7419: i/o timeout
2021/08/31 19:04:35.711630 read tcp 10.1.0.57:42742->10.100.34.52:7419: i/o timeout
2021/08/31 19:04:35.711727 read tcp 10.1.0.57:42740->10.100.34.52:7419: i/o timeout

The workers seems to be connected because they appear in the dashboard but throw the error many times per second which generates a ton of logs, client 1.5.2 seems to work correctly (no error logs + job execution) nothing in the network config changed.

I haven't get any solution for the Kubernetes or Faktory config in order to fix this, I understand that is not a "real" issue just the way that it works but looks like the Kubernetes network is also working as expected so I will appreciate any help.

I created a repo with the files required to reproduce it locally, Docker and Kubernetes are required in order to build/deploy files: https://github.com/aaroalan/faktory_timeout

mperham commented 3 years ago

Based on the ports, it looks like maybe slow HTTP connections to the Web UI? It's hard to diagnose the underlying cause, perhaps this is an issue of Keep-Alives or proxied connections. I don't run Kubernetes so it's hard for me to debug. I'd love to have any other Kubernetes experts chime in with thoughts.

aaroalan commented 3 years ago

Sorry the logs that I added in the issue had port 80 because I got them from the latest test that I did (I updated them), but the logs are in the worker container and errors have port 7419.

mperham commented 3 years ago

Show me the contents of your /debug page.

aaroalan commented 3 years ago

This is local using the project that I created to reproduce it

Screen Shot 2021-08-31 at 3 59 41 PM

Redis Info
# Server
redis_version:6.0.14
redis_git_sha1:ecf4164e
redis_git_dirty:0
redis_build_id:bd2e1423b53357c
redis_mode:standalone
os:Linux 5.10.47-linuxkit x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:10.2.1
process_id:14
run_id:165afd03b95c29852a1233af2eee172641a2a1f2
tcp_port:0
uptime_in_seconds:46
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:3053523
executable:/usr/bin/redis-server
config_file:/tmp/redis.conf
io_threads_active:0

# Clients
connected_clients:14
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:10
tracking_clients:0
clients_in_timeout_table:10

# Memory
used_memory:1060497
used_memory_human:1.01M
used_memory_rss:2666496
used_memory_rss_human:2.54M
used_memory_peak:1157955
used_memory_peak_human:1.10M
used_memory_peak_perc:91.58%
used_memory_overhead:1017972
used_memory_startup:779576
used_memory_dataset:42525
used_memory_dataset_perc:15.14%
allocator_allocated:1026299
allocator_active:2628608
allocator_resident:2628608
total_system_memory:5176152064
total_system_memory_human:4.82G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:2.56
allocator_frag_bytes:1602309
allocator_rss_ratio:1.00
allocator_rss_bytes:0
rss_overhead_ratio:1.01
rss_overhead_bytes:37888
mem_fragmentation_ratio:2.60
mem_fragmentation_bytes:1640197
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:238284
mem_aof_buffer:0
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:2
rdb_bgsave_in_progress:0
rdb_last_save_time:1630443429
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# Stats
total_connections_received:16
total_commands_processed:265
instantaneous_ops_per_sec:4
total_net_input_bytes:11619
total_net_output_bytes:8511
instantaneous_input_kbps:0.24
instantaneous_output_kbps:0.02
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:5
evicted_keys:0
keyspace_hits:0
keyspace_misses:31
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_reads_processed:268
total_writes_processed:255
io_threaded_reads_processed:0
io_threaded_writes_processed:0

# Replication
role:master
connected_slaves:0
master_replid:266cf708579c942375a97b38452c2b62ca1a6607
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:0.161970
used_cpu_user:0.061094
used_cpu_sys_children:0.000000
used_cpu_user_children:0.000000

# Modules

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=2,expires=0,avg_ttl=0

Disk Usage
> df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  58.4G     24.1G     31.3G  44% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                     2.4G         0      2.4G   0% /sys/fs/cgroup
/dev/vda1                58.4G     24.1G     31.3G  44% /dev/termination-log
/dev/vda1                58.4G     24.1G     31.3G  44% /etc/resolv.conf
/dev/vda1                58.4G     24.1G     31.3G  44% /etc/hostname
/dev/vda1                58.4G     24.1G     31.3G  44% /etc/hosts
shm                      64.0M         0     64.0M   0% /dev/shm
tmpfs                     2.4G     12.0K      2.4G   0% /run/secrets/kubernetes.io/serviceaccount
tmpfs                     2.4G         0      2.4G   0% /proc/acpi
tmpfs                    64.0M         0     64.0M   0% /proc/kcore
tmpfs                    64.0M         0     64.0M   0% /proc/keys
tmpfs                    64.0M         0     64.0M   0% /proc/timer_list
tmpfs                    64.0M         0     64.0M   0% /proc/sched_debug
tmpfs                     2.4G         0      2.4G   0% /sys/firmware
mperham commented 3 years ago

Everything looks fine from Faktory's POV. I don't run k8s so I can't advise on that side. There must be some flag or setting that is triggering this behavior.

aaroalan commented 3 years ago

I updated server to 1.5.3 and downgrade clients to use 1.5.2 with no change in network config and everything looks correct running 30 workers processing about 400 jobs/second so does not look like network is really an issue other than the log that I got.

I will continue investigating and update the issue if I get any solution in the K8s side and/or touch base with you for a code change in Faktory.

robpingxu commented 2 years ago

We had a similar issue today in our dev environment. Looks like when we updated golang versions recently, we went up to github.com/contribsys/faktory v1.5.3. Same types of i/o timeout logs constantly.

We reverted back to github.com/contribsys/faktory v1.5.1 and issue went away.

Server is v1.5.3.

aaroalan commented 2 years ago

I got the same issue in development using a docker network. The line that causes the issue is:

// client.go:220
...
_ = x.SetDeadline(time.Now().Add(5 * time.Second))
...

I don't think It is a "real" timeout but some kind of conflict with the timeout detection and the way that k8s and docker networks work, even if you change the duration to 1 minute log starts showing with an idle worker (after one min) + if you remove the timeout worker processes the jobs correctly and no log.

I tried different network configs(simple changes) in k8s but still getting the log, we downgrade client too to fix it temporary but haven't get time to investigate in more detail.

Maybe a "simple" solution would be to make the deadline optional based on an ENV var if we can't find a k8s or docker config that is compatible with it.

mperham commented 2 years ago

This was a painful bug. The deadline effectively made any I/O on the socket after 5 seconds fail, so the Faktory client is constantly erring and setting up new connections. It has been removed on the main branch and I will release 1.5.4 this week so people aren't stuck on an old version.

mperham commented 2 years ago

1.5.4 is now out. Let me know if it behaves better for you.

aaroalan commented 2 years ago

I tested with the repository that I created to replicate the error (https://github.com/aaroalan/faktory_timeout) and can confirm that error log is no longer present using client 1.5.4, will update production apps to the latest version

Thanks @mperham 👍