contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.71k stars 228 forks source link

dial tcp: i/o timeout v1.5.5 #388

Closed AvivM96 closed 1 year ago

AvivM96 commented 2 years ago

Hey, I'm receiving lately many dial tcp timeouts while trying to open connections to the faktory in order to submit jobs. I tried upgrading the server and the client to use the latest version 1.5.5 and the amount got reduced but not solved completely.

I checked the resources of the faktory server and it is not lacking cpu or memory, also I tried to use managed pool connection to the faktory instead of using the faktory.Open() but still not that much improvement. we have very lite usage of the faktory server, we process around 5k jobs daily without any significant peaks.

This is the error I'm receiving -

dial tcp: i/o timeout

On the faktory server logs I'm seeing many failed connections -

2021-12-01T09:35:24.214Z Bad connection: read tcp 10.117.21.13:7419->10.117.12.57:38094: read: connection reset by peer
99
E 2021-12-01T09:38:14.214Z Bad connection: read tcp 10.117.21.13:7419->10.117.12.57:58026: read: connection reset by peer
98
E 2021-12-01T09:38:34.218Z Bad connection: read tcp 10.117.21.13:7419->10.117.12.57:60252: read: connection reset by peer
97
E 2021-12-01T09:40:44.215Z Bad connection: read tcp 10.117.21.13:7419->10.117.12.57:44264: read: connection reset by peer
96
E 2021-12-01T09:43:24.215Z Bad connection: read tcp 10.117.21.13:7419->10.117.12.57:59242: read: connection reset by peer
95
E 2021-12-01T09:44:24.215Z Bad connection: read tcp 10.117.21.13:7419->10.117.12.57:36832: read: connection reset by peer
94
E 2021-12-01T09:47:20.412Z Bad connection: read tcp 10.117.21.13:7419->10.117.12.57:53710: read: connection reset by peer
93
E 2021-12-01T09:50:24.214Z Bad connection: read tcp 10.117.21.13:7419->10.117.12.57:42998: read: connection reset by peer
92
E 2021-12-01T09:50:44.215Z Bad connection: read tcp 10.117.21.13:7419->10.117.12.57:44850: read: connection reset by peer
91
E 2021-12-01T09:56:34.224Z Bad connection: read tcp 10.117.21.13:7419->10.117.12.57:49478: read: connection reset by peer
90
E 2021-12-01T10:00:54.217Z Bad connection: read tcp 10.117.21.13:7419->10.117.12.57:46060: read: connection reset by peer
89
E 2021-12-01T10:05:40.412Z Bad connection: read tcp 10.117.21.13:7419->10.117.12.57:48802: read: connection reset by peer

Here is a paste from /debug

Screen Shot 2021-12-01 at 13 35 33

Redis Info
# Server
redis_version:6.0.15
redis_git_sha1:cdf11233
redis_git_dirty:0
redis_build_id:74d5fb33a6eb465f
redis_mode:standalone
os:Linux 5.4.129+ x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:10.2.1
process_id:45
run_id:af8674f338b3755230219f3b01faf6357e2faf71
tcp_port:0
uptime_in_seconds:1593255
uptime_in_days:18
hz:10
configured_hz:10
lru_clock:10968565
executable:/usr/bin/redis-server
config_file:/tmp/redis.conf
io_threads_active:0

# Clients
connected_clients:61
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:60
tracking_clients:0
clients_in_timeout_table:60

# Memory
used_memory:3631023
used_memory_human:3.46M
used_memory_rss:5812224
used_memory_rss_human:5.54M
used_memory_peak:8710672
used_memory_peak_human:8.31M
used_memory_peak_perc:41.68%
used_memory_overhead:1842858
used_memory_startup:779592
used_memory_dataset:1788165
used_memory_dataset_perc:62.71%
allocator_allocated:3596841
allocator_active:5774336
allocator_resident:5774336
total_system_memory:12563914752
total_system_memory_human:11.70G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.61
allocator_frag_bytes:2177495
allocator_rss_ratio:1.00
allocator_rss_bytes:0
rss_overhead_ratio:1.01
rss_overhead_bytes:37888
mem_fragmentation_ratio:1.62
mem_fragmentation_bytes:2215383
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:1039578
mem_aof_buffer:0
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:10
rdb_bgsave_in_progress:0
rdb_last_save_time:1638358506
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:868352
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# Stats
total_connections_received:558
total_commands_processed:48110913
instantaneous_ops_per_sec:35
total_net_input_bytes:4128207918
total_net_output_bytes:1079332567
instantaneous_input_kbps:2.11
instantaneous_output_kbps:0.17
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:40305
evicted_keys:0
keyspace_hits:572112
keyspace_misses:523877
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:295
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_reads_processed:48125916
total_writes_processed:48111313
io_threaded_reads_processed:0
io_threaded_writes_processed:0

# Replication
role:master
connected_slaves:0
master_replid:08e70632acd02bddb3f64edc6cb8307c794878d6
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:3640.645827
used_cpu_user:2479.368255
used_cpu_sys_children:151.504400
used_cpu_user_children:319.291035

# Modules

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=489,expires=0,avg_ttl=0

Disk Usage
> df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  94.3G     13.5G     80.8G  14% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                     5.8G         0      5.8G   0% /sys/fs/cgroup
/dev/sda1                94.3G     13.5G     80.8G  14% /dev/termination-log
shm                      64.0M         0     64.0M   0% /dev/shm
/dev/sda1                94.3G     13.5G     80.8G  14% /etc/resolv.conf
/dev/sda1                94.3G     13.5G     80.8G  14% /etc/hostname
/dev/sda1                94.3G     13.5G     80.8G  14% /etc/hosts
/dev/sda1                94.3G     13.5G     80.8G  14% /etc/faktory/conf.d
/dev/sdb                  7.8G     37.2M      7.8G   0% /var/lib/faktory
tmpfs                     5.8G     12.0K      5.8G   0% /run/secrets/kubernetes.io/serviceaccount
tmpfs                     5.8G         0      5.8G   0% /proc/acpi
tmpfs                    64.0M         0     64.0M   0% /proc/kcore
tmpfs                    64.0M         0     64.0M   0% /proc/keys
tmpfs                    64.0M         0     64.0M   0% /proc/timer_list
tmpfs                     5.8G         0      5.8G   0% /proc/scsi
tmpfs                     5.8G         0      5.8G   0% /sys/firmware

I appreciate the effort, thank you!

mperham commented 2 years ago

You'll need to give me an app which reproduces the error so I can debug and fix the issue.