[ENT Batches] - Problem with Batches getting stuck and never completing

Tylerian commented 2 months ago

Checklist

Which Faktory package and version? Faktory v1.9.0
Which Faktory worker package and version? faktory_workers_go v1.9.0
Please include any relevant worker configuration
```
Workers = 3
Concurrency = 2
```

Please include any relevant error messages or stacktraces


Client
--
Unable to report JID Qtjl6r_Ifxh32kUQ result to Faktory: read tcp 172.16.114.109:56062->10.100.183.42:7419: i/o timeout

Server

Unable to process timed job: cannot retry reservation: Job not found Qtjl6r_Ifxh32kUQ No such job to acknowledge Qtjl6r_Ifxh32kUQ


Are you using an old version?
   No

Have you checked the changelogs to see if your issue has been fixed in a later version?
   Yes

## Context

We're running a bulk process with Faktory which triggers millions of individual Jobs wrapped in Batches to split the work into manageable chunks.

## Problem

Sometimes the Batches UI page shows 1 pending Job which is neither running nor waiting to be processed in the queue, leaving the Batch stuck and never completing. The success/complete callbacks on the Batch aren't being called neither.

<img width="714" alt="Screenshot 2024-09-02 at 14 37 21" src="https://github.com/user-attachments/assets/c4968181-1727-4f72-a98e-561e49b00d02">

When finding for logs, there is little to be seen. The most I've managed to find are networking error logs like the following:

- Worker logs:
`Unable to report JID Qtjl6r_Ifxh32kUQ result to Faktory: read tcp 172.16.114.109:56062->10.100.183.42:7419: i/o timeout`

- Server logs:
`Unable to process timed job: cannot retry reservation: Job not found Qtjl6r_Ifxh32kUQ`
`No such job to acknowledge Qtjl6r_Ifxh32kUQ`

mperham commented 2 months ago

Tell me about your Redis setup. Are you using a remote Redis? Can you show me the info output?

Tylerian commented 2 months ago

Tell me about your Redis setup. Are you using a remote Redis? Can you show me the info output?

We're using a remote Redis provided by Redislabs, hosted in the same AZ and Datacenter as our Faktory Server.

Info Output

Redis RTT?: 497.387 µs

# Server
redis_version:7.2.4
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:0000000000000000000000000000000000000000
redis_mode:standalone
os:Linux 5.15.0-1066-aws x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:9.4.0
process_id:11875333
run_id:f9e2baa4fe01d8d4b313c2b07d111ed08b336d75
tcp_port:14005
server_time_usec:1725532045000000
uptime_in_seconds:31505419
uptime_in_days:364
hz:10
lru_clock:0
config_file:

# Clients
connected_clients:48
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:40
maxclients:0
cluster_connections:0

# Memory
used_memory:47926000
used_memory_human:45.70M
used_memory_rss:47926000
used_memory_peak:124602336
used_memory_peak_human:118.83M
used_memory_lua:77824
maxmemory_policy:noeviction
mem_fragmentation_ratio:1
mem_allocator:jemalloc-5.3.0

# Persistence
loading:0
rdb_changes_since_last_save:57472881
rdb_bgsave_in_progress:0
rdb_last_save_time:1723111940
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_current_size:0
aof_base_size:0
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0

# Stats
total_connections_received:972171
total_commands_processed:175702809
instantaneous_ops_per_sec:1343
total_net_input_bytes:20867838112
total_net_output_bytes:51774462140
instantaneous_input_kbps:197.51
instantaneous_output_kbps:225.87
rejected_connections:0
sync_full:1
sync_partial_ok:0
sync_partial_err:0
expired_keys:13523
evicted_keys:0
keyspace_hits:83029617
keyspace_misses:176121128
pubsub_channels:0
pubsub_patterns:0
pubsubshard_channels:0
latest_fork_usec:0
migrate_cached_sockets:0
total_forks:0
total_error_replies:0
total_eviction_exceeded_time:0
current_eviction_exceeded_time:0
total_active_defrag_time:0
current_active_defrag_time:0

# Replication
role:master
connected_slaves:1
slave0:ip=0.0.0.0,port=0,state=online,offset=0,lag=0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:0.00
used_cpu_user:0.00
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
used_cpu_sys_main_thread:0.00
used_cpu_user_main_thread:0.00

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=2565,expires=12,avg_ttl=2006424046

mperham commented 2 months ago

Your Redis looks perfect.

How often does this problem happen? Every time you run this workflow? The other tricky thing to look at is deployments. Do you see problems popping up around the times when you start/stop your worker processes?

Tylerian commented 2 months ago

How often does this problem happen? Every time you run this workflow?

It happens intermittently. Its not correlated to the workflow itself, but the workflow runs long enough for it to occur, as once a batch finishes another one is instantly created and enqueued ad infinitum.

The other tricky thing to look at is deployments. Do you see problems popping up around the times when you start/stop your worker processes?

It might be the case, as our k8s cluster uses a mix of regular and spot instances as k8s nodes, so both the Faktory server & the worker processes can be rescheduled to other nodes regularly.

Another little thing we found in logs is we have many Unable to report JID result to Faktory errors on the worker processes, but only those who are of i/o timeout coincides with the Pending JIDs keeping the Batch from succeeding.

io/timeout errors have been logged both for Faktory Worker -> Faktory Server and Faktory Server -> Redis as summarised below:

- Unable to report JID <jid> result to Faktory: ERR dial tcp <redis_server_address>:6379: i/o timeout
- Unable to report JID <jid> result to Faktory: read tcp <faktory_worker_address>:46470 -> <faktory_server_address>:7419: i/o timeout

mperham commented 2 months ago

Is your k8s cluster is shutting down Faktory? You should not shut down Faktory until all worker processes are gone.

Tylerian commented 2 months ago

You should not shut down Faktory until all worker processes are gone.

Hmm by the nature of our k8s cluster we don't have any way to control that. Does Faktory provides any way to perform "graceful shutdowns" to counter node rescheduling?

mperham commented 2 months ago

That's just it, you need to treat Faktory like a database. You can't arbitrarily take it down unless all workers/clients are down too. I think this is called a StatefulSet? and your worker pods should declare a dependency on the Faktory service so any worker pods are shut down before Faktory.

Tylerian commented 2 months ago

Hmm okay, we'll try to configure our workers as a dependency of Faktory server and will keep you posted.

You can't arbitrarily take it down unless all workers/clients are down too.

I have a followup question about this, though. If Faktory's state is stored in Redis, why does the shutdown order matter? Wouldn't that scenario be similar to a transient networking error for example? The logged errors that originates from this issue are network errors afterall.

Tylerian commented 2 months ago

I've been reviewing another Batch that failed today for the same reason, and this time the Server pod hasn't been rescheduled; as both the BATCH NEW ... and the cannot retry reservation: Job not found logs have been emitted by the same pod, which is still alive as I type right now.

This time the Batch is stuck with 6 jobs pending, all of them appear in consecutive error logs printed by Faktory server.

2024-09-11 01:21:45.439 | W 2024-09-11T01:21:45.439Z Unable to process timed job: cannot retry reservation: Job not found 3ciuHVnFiTwAtXnq
2024-09-11 01:21:45.438 | W 2024-09-11T01:21:45.438Z Unable to process timed job: cannot retry reservation: Job not found iLPAY-kngjMN6Kcr 
2024-09-11 01:21:45.437 | W 2024-09-11T01:21:45.437Z Unable to process timed job: cannot retry reservation: Job not found Spfq2ntC5X11lWxt 
2024-09-11 01:21:45.435 | W 2024-09-11T01:21:45.435Z Unable to process timed job: cannot retry reservation: Job not found IkPIQsd-QG_qfgIa 
2024-09-11 01:21:45.434 | W 2024-09-11T01:21:45.433Z Unable to process timed job: cannot retry reservation: Job not found A8cIGjuWfc1p3uoK

mperham commented 2 months ago

@Tylerian I would make sure you are shutting down your Faktory worker processes cleanly. If they are being killed, it's possible there's a bug in Faktory leaving job data in a half-baked state.

Tylerian commented 2 months ago

@Tylerian I would make sure you are shutting down your Faktory worker processes cleanly. If they are being killed, it's possible there's a bug in Faktory leaving job data in a half-baked state.

All our worker processes are configured to shutdown gracefully. We've set a terminationGracePeriodSeconds greater than faktory_worker_go's default shutdown period of 30 seconds for all our Worker pods. Furthermore, we're confident our jobs takes no longer than 30 seconds to process.

Bear in mind neither the Faktory Server process nor the Application Worker processes were killed by k8s the last time this issue popped up, which hints it could be triggered by transient network errors such as connection timeouts.

mperham commented 2 months ago

Do you think the issue is transient network errors between Client <-> Faktory or Faktory <-> Redis?

I would think the latter, I will review the batch code and make sure we're using MULTI where necessary to ensure transactional changes.

Tylerian commented 1 month ago

I don't know to be honest, both Faktory Server and Redis IP addresses have been reported in Unable to report JID <jid> result to Faktory: [...] i/o timeout logs.

What I keep observing is Faktory fails to acknowledge the result of jobs within the Batch when a transient error occurs, be it a k8s deployment or a network error. Even though the jobs are successfully processed later and removed from Faktory's queue, which might hint some kind of bug in the Batching internals logic.

Sorry for the lack of specifity, but it's all I can do as a mere observer from the outside.

mperham commented 1 month ago

I've not been able to track this down, I'll need more data if we want to solve this.

contribsys / faktory