Open Tylerian opened 2 months ago
Tell me about your Redis setup. Are you using a remote Redis? Can you show me the info
output?
Tell me about your Redis setup. Are you using a remote Redis? Can you show me the
info
output?
We're using a remote Redis provided by Redislabs, hosted in the same AZ and Datacenter as our Faktory Server.
Redis RTT?: 497.387 µs
# Server
redis_version:7.2.4
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:0000000000000000000000000000000000000000
redis_mode:standalone
os:Linux 5.15.0-1066-aws x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:9.4.0
process_id:11875333
run_id:f9e2baa4fe01d8d4b313c2b07d111ed08b336d75
tcp_port:14005
server_time_usec:1725532045000000
uptime_in_seconds:31505419
uptime_in_days:364
hz:10
lru_clock:0
config_file:
# Clients
connected_clients:48
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:40
maxclients:0
cluster_connections:0
# Memory
used_memory:47926000
used_memory_human:45.70M
used_memory_rss:47926000
used_memory_peak:124602336
used_memory_peak_human:118.83M
used_memory_lua:77824
maxmemory_policy:noeviction
mem_fragmentation_ratio:1
mem_allocator:jemalloc-5.3.0
# Persistence
loading:0
rdb_changes_since_last_save:57472881
rdb_bgsave_in_progress:0
rdb_last_save_time:1723111940
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_current_size:0
aof_base_size:0
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0
# Stats
total_connections_received:972171
total_commands_processed:175702809
instantaneous_ops_per_sec:1343
total_net_input_bytes:20867838112
total_net_output_bytes:51774462140
instantaneous_input_kbps:197.51
instantaneous_output_kbps:225.87
rejected_connections:0
sync_full:1
sync_partial_ok:0
sync_partial_err:0
expired_keys:13523
evicted_keys:0
keyspace_hits:83029617
keyspace_misses:176121128
pubsub_channels:0
pubsub_patterns:0
pubsubshard_channels:0
latest_fork_usec:0
migrate_cached_sockets:0
total_forks:0
total_error_replies:0
total_eviction_exceeded_time:0
current_eviction_exceeded_time:0
total_active_defrag_time:0
current_active_defrag_time:0
# Replication
role:master
connected_slaves:1
slave0:ip=0.0.0.0,port=0,state=online,offset=0,lag=0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
# CPU
used_cpu_sys:0.00
used_cpu_user:0.00
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
used_cpu_sys_main_thread:0.00
used_cpu_user_main_thread:0.00
# Cluster
cluster_enabled:0
# Keyspace
db0:keys=2565,expires=12,avg_ttl=2006424046
Your Redis looks perfect.
How often does this problem happen? Every time you run this workflow? The other tricky thing to look at is deployments. Do you see problems popping up around the times when you start/stop your worker processes?
How often does this problem happen? Every time you run this workflow?
It happens intermittently. Its not correlated to the workflow itself, but the workflow runs long enough for it to occur, as once a batch finishes another one is instantly created and enqueued ad infinitum.
The other tricky thing to look at is deployments. Do you see problems popping up around the times when you start/stop your worker processes?
It might be the case, as our k8s cluster uses a mix of regular and spot instances as k8s nodes, so both the Faktory server & the worker processes can be rescheduled to other nodes regularly.
Another little thing we found in logs is we have many Unable to report JID result to Faktory
errors on the worker processes, but only those who are of i/o timeout
coincides with the Pending JIDs keeping the Batch from succeeding.
io/timeout errors have been logged both for Faktory Worker -> Faktory Server
and Faktory Server -> Redis
as summarised below:
- Unable to report JID <jid> result to Faktory: ERR dial tcp <redis_server_address>:6379: i/o timeout
- Unable to report JID <jid> result to Faktory: read tcp <faktory_worker_address>:46470 -> <faktory_server_address>:7419: i/o timeout
Is your k8s cluster is shutting down Faktory? You should not shut down Faktory until all worker processes are gone.
You should not shut down Faktory until all worker processes are gone.
Hmm by the nature of our k8s cluster we don't have any way to control that. Does Faktory provides any way to perform "graceful shutdowns" to counter node rescheduling?
That's just it, you need to treat Faktory like a database. You can't arbitrarily take it down unless all workers/clients are down too. I think this is called a StatefulSet? and your worker pods should declare a dependency on the Faktory service so any worker pods are shut down before Faktory.
Hmm okay, we'll try to configure our workers as a dependency of Faktory server and will keep you posted.
You can't arbitrarily take it down unless all workers/clients are down too.
I have a followup question about this, though. If Faktory's state is stored in Redis, why does the shutdown order matter? Wouldn't that scenario be similar to a transient networking error for example? The logged errors that originates from this issue are network errors afterall.
I've been reviewing another Batch that failed today for the same reason, and this time the Server pod hasn't been rescheduled; as both the BATCH NEW ...
and the cannot retry reservation: Job not found
logs have been emitted by the same pod, which is still alive as I type right now.
This time the Batch is stuck with 6 jobs pending, all of them appear in consecutive error logs printed by Faktory server.
2024-09-11 01:21:45.439 | W 2024-09-11T01:21:45.439Z Unable to process timed job: cannot retry reservation: Job not found 3ciuHVnFiTwAtXnq
2024-09-11 01:21:45.438 | W 2024-09-11T01:21:45.438Z Unable to process timed job: cannot retry reservation: Job not found iLPAY-kngjMN6Kcr
2024-09-11 01:21:45.437 | W 2024-09-11T01:21:45.437Z Unable to process timed job: cannot retry reservation: Job not found Spfq2ntC5X11lWxt
2024-09-11 01:21:45.435 | W 2024-09-11T01:21:45.435Z Unable to process timed job: cannot retry reservation: Job not found IkPIQsd-QG_qfgIa
2024-09-11 01:21:45.434 | W 2024-09-11T01:21:45.433Z Unable to process timed job: cannot retry reservation: Job not found A8cIGjuWfc1p3uoK
@Tylerian I would make sure you are shutting down your Faktory worker processes cleanly. If they are being killed, it's possible there's a bug in Faktory leaving job data in a half-baked state.
@Tylerian I would make sure you are shutting down your Faktory worker processes cleanly. If they are being killed, it's possible there's a bug in Faktory leaving job data in a half-baked state.
All our worker processes are configured to shutdown gracefully. We've set a terminationGracePeriodSeconds
greater than faktory_worker_go's default shutdown period of 30 seconds for all our Worker pods. Furthermore, we're confident our jobs takes no longer than 30 seconds to process.
Bear in mind neither the Faktory Server process nor the Application Worker processes were killed by k8s the last time this issue popped up, which hints it could be triggered by transient network errors such as connection timeouts.
Do you think the issue is transient network errors between Client <-> Faktory or Faktory <-> Redis?
I would think the latter, I will review the batch code and make sure we're using MULTI where necessary to ensure transactional changes.
I don't know to be honest, both Faktory Server and Redis IP addresses have been reported in Unable to report JID <jid> result to Faktory: [...] i/o timeout
logs.
What I keep observing is Faktory fails to acknowledge the result of jobs within the Batch when a transient error occurs, be it a k8s deployment or a network error. Even though the jobs are successfully processed later and removed from Faktory's queue, which might hint some kind of bug in the Batching internals logic.
Sorry for the lack of specifity, but it's all I can do as a mere observer from the outside.
I've not been able to track this down, I'll need more data if we want to solve this.
Checklist
Which Faktory package and version? Faktory v1.9.0
Which Faktory worker package and version? faktory_workers_go v1.9.0
Please include any relevant worker configuration
Please include any relevant error messages or stacktraces
Server
Unable to process timed job: cannot retry reservation: Job not found Qtjl6r_Ifxh32kUQ No such job to acknowledge Qtjl6r_Ifxh32kUQ