contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.73k stars 227 forks source link

dial unix /var/lib/faktory/db/redis.sock: connect: resource temporarily unavailable #371

Closed OhadArzouan closed 1 year ago

OhadArzouan commented 3 years ago
Since using 1.5.2 we are seeing an increase of these kind of errors:
 `dial unix /var/lib/faktory/db/redis.sock: connect: resource temporarily unavailable`
What could cause this new behaviour or at least cause this rather large increase in these cases?

Would love to share any info that might help figure this out.
Thanks,
Ohad

Are you using an old version? No Have you checked the changelogs to see if your issue has been fixed in a later version?

https://github.com/contribsys/faktory/blob/master/Changes.md https://github.com/contribsys/faktory/blob/master/Pro-Changes.md https://github.com/contribsys/faktory/blob/master/Ent-Changes.md

mperham commented 3 years ago

I'm not aware of this issue or what might cause it. I'll need more specifics but I'm not even sure what to ask for.

mperham commented 3 years ago

I just reviewed the v1.5.1..v1.5.2 diff and don't see anything which would cause this. You don't say what you upgraded from.

mperham commented 3 years ago

Paste the contents of the /debug page, for a start.

OhadArzouan commented 3 years ago

Sorry for the late response, we upgraded from 1.4.2... We are running Faktory on an ECS instance with Fargate As for the /debug info:

Redis Info
# Server
redis_version:6.0.14
redis_git_sha1:ecf4164e
redis_git_dirty:0
redis_build_id:bd2e1423b53357c
redis_mode:standalone
os:Linux 4.14.238-182.422.amzn2.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:10.2.1
process_id:27
run_id:fd849b01dea162356b305e752b51fbf0b93f52c1
tcp_port:0
uptime_in_seconds:99011
uptime_in_days:1
hz:10
configured_hz:10
lru_clock:2316464
executable:/usr/bin/redis-server
config_file:/tmp/redis.conf
io_threads_active:0

# Clients
connected_clients:251
client_recent_max_input_buffer:32774
client_recent_max_output_buffer:16424
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0

# Memory
used_memory:423373026
used_memory_human:403.76M
used_memory_rss:578756608
used_memory_rss_human:551.95M
used_memory_peak:962746814
used_memory_peak_human:918.15M
used_memory_peak_perc:43.98%
used_memory_overhead:37320386
used_memory_startup:779576
used_memory_dataset:386052640
used_memory_dataset_perc:91.35%
allocator_allocated:423621248
allocator_active:578690048
allocator_resident:578690048
total_system_memory:32143994880
total_system_memory_human:29.94G
used_memory_lua:79872
used_memory_lua_human:78.00K
used_memory_scripts:2844
used_memory_scripts_human:2.78K
number_of_cached_scripts:4
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.37
allocator_frag_bytes:155068800
allocator_rss_ratio:1.00
allocator_rss_bytes:0
rss_overhead_ratio:1.00
rss_overhead_bytes:66560
mem_fragmentation_ratio:1.37
mem_fragmentation_bytes:155135360
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:4359374
mem_aof_buffer:0
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:161052
rdb_bgsave_in_progress:1
rdb_last_save_time:1629706377
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:19
rdb_current_bgsave_time_sec:8
rdb_last_cow_size:143691776
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# Stats
total_connections_received:13854
total_commands_processed:3094324895
instantaneous_ops_per_sec:41481
total_net_input_bytes:1623262490429
total_net_output_bytes:109466234743
instantaneous_input_kbps:21848.88
instantaneous_output_kbps:1141.99
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:8305702
expired_stale_perc:12.43
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:49757
evicted_keys:0
keyspace_hits:102473087
keyspace_misses:1250248966
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:9353
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_reads_processed:1463976446
total_writes_processed:1462719079
io_threaded_reads_processed:0
io_threaded_writes_processed:0

# Replication
role:master
connected_slaves:0
master_replid:759c135676e96aa4afebea6686f758efa1e71f9f
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:22453.085994
used_cpu_user:48718.990022
used_cpu_sys_children:1357.486476
used_cpu_user_children:9929.667411

# Modules

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=240894,expires=240234,avg_ttl=126454244

Disk Usage
> df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  29.4G     10.3G     17.5G  37% /
tmpfs                    64.0M         0     64.0M   0% /dev
shm                      15.0G         0     15.0G   0% /dev/shm
tmpfs                    15.0G         0     15.0G   0% /sys/fs/cgroup
127.0.0.1:/               8.0E    183.0M      8.0E   0% /var/lib/faktory
/dev/xvdcz               29.4G     10.3G     17.5G  37% /etc/hosts
/dev/xvdcz               29.4G     10.3G     17.5G  37% /etc/resolv.conf
/dev/xvdcz               29.4G     10.3G     17.5G  37% /etc/hostname
/dev/xvda1                4.9G      1.7G      3.1G  35% /managed-agents/execute-command
tmpfs                    15.0G         0     15.0G   0% /proc/acpi
tmpfs                    64.0M         0     64.0M   0% /proc/kcore
tmpfs                    64.0M         0     64.0M   0% /proc/keys
tmpfs                    64.0M         0     64.0M   0% /proc/latency_stats
tmpfs                    64.0M         0     64.0M   0% /proc/timer_list
tmpfs                    64.0M         0     64.0M   0% /proc/sched_debug
tmpfs                    15.0G         0     15.0G   0% /sys/firmware
tmpfs                    15.0G         0     15.0G   0% /proc/scsi
mperham commented 3 years ago

I see:

The only unusual metric I see is this:

instantaneous_input_kbps:21848.88
instantaneous_output_kbps:1141.99

That's a lot of input (20Mb/sec) and not a lot of output (1Mb/sec). I have to wonder why the network is so busy.

0xO0O0 commented 1 year ago

Have seen this bug too, using the latest version of faktory-ent, this bug is causing the container to shutdown and and spawns a new one, over and over, the setup is in fargate running as a sidecar container in a Task with the api container as main, both the api and faktory container share the EFS path /var/lib/faktory any insight to the cause of this ?