heartbeat error: ERR Unknown worker

robpingxu commented 3 years ago

We have faktory setup in GCP.

There is a compute engine VM running docker image contribsys/faktory:1.5.1 and a cloud run worker service using github.com/contribsys/faktory_worker_go v1.4.2

A few days ago, for several hours we saw saw heartbeat error: ERR Unknown worker 4pf5cn20s4elh every 10 seconds.

It looks like there was a network hiccup so it makes sense that faktory wasn't able to get a heartbeat from a worker for a time so it assumed the worker was bad and no longer recognized it.

What doesn't make sense if during the time of this error, it looks like the worker microservice was still able to process some events. The error was going on for hours and we see quite a few events still got processed throughout that time.

Is this expected? Trying to understand faktory better to better debug and utilize it.

mperham commented 3 years ago

That indicates a worker process did not send its heartbeat within 60 seconds; Faktory thought it was dead and reaped its record (so it will disappear from the list on the Busy page).

This is a unrecoverable error. Once you see this error, you need to restart the worker process and determine why your process did not heartbeat at least once per minute (I recommend it send a heartbeat at least every 15 seconds).

robpingxu commented 3 years ago

It is odd that some events were still being processed. Regardless, then when this error occurs, the best thing would be for the service to terminate so a new instance would be started by Cloud Run/Kubernetes.

Is there a way to detect this error and cause it to terminate from outside of the package?

Or ideally, would it be better for mgr.Terminate(true) to be called when this error occurs from within the fwg?

mperham commented 3 years ago

Can you tell me how many threads your fwg process is running concurrently? And how many connections you have to Faktory? Seeing the /debug page output would be helpful.

robpingxu commented 3 years ago

Sure. each fwg instance runs 5 concurrently. Should scale up if CPU utilization is too high.

We have about 8 microservices that can publish to faktory but usually don't see requests from most of them.

Each one has faktory connection pool max size of 100 (probably excessive) though in the logs, the connection pool is usually size 1 with a max of 6 in the last week.

/debug

Screen Shot 2021-08-31 at 10 58 20 PM

Redis Info
# Server
redis_version:6.0.11
redis_git_sha1:1522534f
redis_git_dirty:0
redis_build_id:becf7110ea30a8a3
redis_mode:standalone
os:Linux 5.4.104+ x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:10.2.1
process_id:10
run_id:12098c42c5f85be9cd3e4e468f4a9d830f3ff49e
tcp_port:0
uptime_in_seconds:7078213
uptime_in_days:81
hz:10
configured_hz:10
lru_clock:3075023
executable:/usr/bin/redis-server
config_file:/tmp/redis.conf
io_threads_active:0

# Clients
connected_clients:6
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:5
tracking_clients:0
clients_in_timeout_table:5

# Memory
used_memory:982735
used_memory_human:959.70K
used_memory_rss:1966080
used_memory_rss_human:1.88M
used_memory_peak:1463495
used_memory_peak_human:1.40M
used_memory_peak_perc:67.15%
used_memory_overhead:897052
used_memory_startup:779576
used_memory_dataset:85683
used_memory_dataset_perc:42.18%
allocator_allocated:948754
allocator_active:1928192
allocator_resident:1928192
total_system_memory:1032220672
total_system_memory_human:984.40M
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:2.03
allocator_frag_bytes:979438
allocator_rss_ratio:1.00
allocator_rss_bytes:0
rss_overhead_ratio:1.02
rss_overhead_bytes:37888
mem_fragmentation_ratio:2.07
mem_fragmentation_bytes:1017326
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:102180
mem_aof_buffer:0
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:1
rdb_bgsave_in_progress:0
rdb_last_save_time:1630464970
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:184320
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# Stats
total_connections_received:2252
total_commands_processed:21054809
instantaneous_ops_per_sec:3
total_net_input_bytes:1259097871
total_net_output_bytes:127521931
instantaneous_input_kbps:0.15
instantaneous_output_kbps:0.01
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:128434
evicted_keys:0
keyspace_hits:114613
keyspace_misses:3335427
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:19154
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_reads_processed:21051805
total_writes_processed:21049530
io_threaded_reads_processed:0
io_threaded_writes_processed:0

# Replication
role:master
connected_slaves:0
master_replid:89c4440da111e382d3d21f7487e441dd01f344c8
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:8245.970729
used_cpu_user:5569.880499
used_cpu_sys_children:9.124954
used_cpu_user_children:5.232808

# Modules

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=280,expires=0,avg_ttl=0

Disk Usage
> df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                   5.7G    778.5M      4.9G  13% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                   492.2M         0    492.2M   0% /sys/fs/cgroup
shm                      64.0M         0     64.0M   0% /dev/shm
/dev/sda1                 5.7G    778.5M      4.9G  13% /etc/resolv.conf
/dev/sda1                 5.7G    778.5M      4.9G  13% /etc/hostname
/dev/sda1                 5.7G    778.5M      4.9G  13% /etc/hosts
/dev/sdb                 97.9G     60.0M     92.9G   0% /var/lib/faktory
tmpfs                   492.2M         0    492.2M   0% /proc/acpi
tmpfs                    64.0M         0     64.0M   0% /proc/kcore
tmpfs                    64.0M         0     64.0M   0% /proc/keys
tmpfs                    64.0M         0     64.0M   0% /proc/timer_list
tmpfs                   492.2M         0    492.2M   0% /proc/scsi
tmpfs                   492.2M         0    492.2M   0% /sys/firmware

Is there anything in particular that I need to keep an eye out for?

mperham commented 3 years ago

It sounds like Faktory is pretty lightly loaded. The issue is more likely in your worker process. Can you show me the code which sets up FWG?

robpingxu commented 3 years ago

Sure. Basically have the struct:

package faktory

import (
    ...
    worker "github.com/contribsys/faktory_worker_go"
    ...
)

type Manager struct {
    mgr *worker.Manager
}

func (m *Manager) Init() error {
    m.mgr = worker.NewManager()

    m.mgr.Concurrency = 5
    // only expecting these two queues
    m.mgr.ProcessWeightedPriorityQueues(
        map[string]int{
            "Default":   1,
            "Important": 2,
        },
    )

    return nil
}

func (m *Manager) Register(e *echo.Echo, workers map[string]events.WorkerFunc) error {
    for jobNameIter := range workers {
        jobName := jobNameIter 
        m.mgr.Register(jobName, func(ctx context.Context, args ...interface{}) error {
            ...
            setting up job-id in the logger and some logging
            calling custom worker func
            ...
        })
    }
    return nil
}

func (m *Manager) Run() {
    m.mgr.Run()
}

I don't think there is much here.

mperham commented 3 years ago

Yeah, looks pretty straightforward. Perhaps the worker process should auto-shutdown if the heartbeat returns a well-known error message like that…

robpingxu commented 3 years ago

Since the error is unrecoverable, fwg finishing what it is currently doing and terminating would be pretty ideal for cloud run/kubernetes. Otherwise, we would have set up alerts and have someone manually deal with it.

aaroalan commented 3 years ago

I am getting the same error in Kubernetes network, worker version 1.4.0. I can't constantly reproduce it but I noticed it happened multiple times when workers were under load and I scale workers down. Maybe the heartbeat fails when the service is updating the iptables but just theory.

The auto-shutdown sounds great, in k8s you can monitor the process and pod will restart automatically if it gets killed.

mperham commented 3 years ago

I've added this auto-shutdown behavior to FWG as of the v1.5.0 tag.

contribsys / faktory_worker_go

heartbeat error: ERR Unknown worker #57