contribsys / faktory

Language-agnostic persistent background job server
https://contribsys.com/faktory/
Other
5.72k stars 227 forks source link

redis: transaction failed #305

Closed scottrobertson closed 4 years ago

scottrobertson commented 4 years ago

Seeing this quite a bit lately. Any ideas on how to debug/fix?

Faktory::CommandError: Faktory::CommandError: redis: transaction failed
mperham commented 4 years ago

Anything in Faktory's logs? Redis out of RAM or disk space?

mperham commented 4 years ago

One debugging tip: you should have a persistent directory mounted for the Redis datafile, in production this will be /var/lib/faktory/db. When Faktory is running, you will find a redis.sock file there, you can monitor the raw Redis commands by running redis-cli -s /var/lib/faktory/db/redis.sock monitor. Note that you will get a LOT of output so you'll want to trim it down somehow to make it manageable to comb through.

mperham commented 4 years ago

It would also be helpful to see the output from redis-cli -s /var/lib/faktory/db/redis.sock info.

mperham commented 4 years ago

I've created a Debugging wiki page which collects a few tips for production issues. Note also the /debug page in the Web UI.

scottrobertson commented 4 years ago
# Server
redis_version:5.0.7
redis_git_sha1:bed89672
redis_git_dirty:0
redis_build_id:575dc7b6da705497
redis_mode:standalone
os:Linux 4.15.0-99-generic x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:9.2.0
process_id:34
run_id:a8e92074b0aca0403c82c9b50b50176217474831
tcp_port:0
uptime_in_seconds:1920227
uptime_in_days:22
hz:10
configured_hz:10
lru_clock:14573375
executable:/usr/bin/redis-server
config_file:/tmp/redis.conf

# Clients
connected_clients:76
client_recent_max_input_buffer:32774
client_recent_max_output_buffer:16424
blocked_clients:0

# Memory
used_memory:630621908
used_memory_human:601.41M
used_memory_rss:1322192896
used_memory_rss_human:1.23G
used_memory_peak:1993604075
used_memory_peak_human:1.86G
used_memory_peak_perc:31.63%
used_memory_overhead:29744767
used_memory_startup:860027
used_memory_dataset:600877141
used_memory_dataset_perc:95.41%
allocator_allocated:630597059
allocator_active:1322122240
allocator_resident:1322122240
total_system_memory:33675313152
total_system_memory_human:31.36G
used_memory_lua:54272
used_memory_lua_human:53.00K
used_memory_scripts:2936
used_memory_scripts_human:2.87K
number_of_cached_scripts:4
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:2.10
allocator_frag_bytes:691525181
allocator_rss_ratio:1.00
allocator_rss_bytes:0
rss_overhead_ratio:1.00
rss_overhead_bytes:70656
mem_fragmentation_ratio:2.10
mem_fragmentation_bytes:691595837
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:1843196
mem_aof_buffer:0
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:128229
rdb_bgsave_in_progress:0
rdb_last_save_time:1591631668
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:12
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:198565888
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

# Stats
total_connections_received:51330
total_commands_processed:29856263357
instantaneous_ops_per_sec:30059
total_net_input_bytes:15812274153561
total_net_output_bytes:1678701402586
instantaneous_input_kbps:16166.65
instantaneous_output_kbps:1164.67
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:137065217
expired_stale_perc:20.45
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:890404465
keyspace_misses:11590684883
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:18174
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0

# Replication
role:master
connected_slaves:0
master_replid:af62e7e80d29d24c8fb432b865867b4a12b4f3b1
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:125759.620303
used_cpu_user:317390.176837
used_cpu_sys_children:18584.895236
used_cpu_user_children:119112.351288

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=291455,expires=291325,avg_ttl=175027558
scottrobertson commented 4 years ago

@mperham for what it's worth, we seem to be seeing this on jobs that are super fast, with very little code in them at all (sometimes only 1 single line runs, with no DB queries etc).

mperham commented 4 years ago

Do those jobs use any special features or are they plain old jobs?

scottrobertson commented 4 years ago

From what I can tell, they are always within a batch.

mperham commented 4 years ago

Ah, that is a very useful thing to know. I will review the batch code.

On Jun 8, 2020, at 15:58, Scott Robertson notifications@github.com wrote:

 From what I can tell, they are always within a batch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

mperham commented 4 years ago

I see where the error is raised in the Redis client code and I see the places in the batch code where I use that API but I haven't been able to reproduce the error in a test yet.

I'm creating 1000 fast jobs and having 10% fail, like this, so we get a mix of ACK and FAIL responses:

func fastFunc(ctx context.Context, args ...interface{}) error {
    if rand.Int31()%10 == 1 {
        return errors.New("oops")
    }
    return nil
}

But everything executes as expected. Can you give me any more specifics about this batch and its configuration that is triggering the error? Can you show me the code that constructs the batch?

scottrobertson commented 4 years ago

There is nothing special about the batch at all. Sharing the code would not really be any different to what is in the docs.

This isn't something that happens all the time. We create probably ~1m batches per day, and we are seeing this happen about 50 times per day, so it's going to be super hard to track down I think.

Here is the full stack trace of the error:

Screenshot 2020-06-09 at 20 49 00
mperham commented 4 years ago

Ugh, ok. It's good to know that it happens during an ACK. The fix does not look trivial and unless I can reproduce it, I can't know if I'm fixing the issue by changing anything (or possibly introducing new bugs!); I'll keep digging on the specific Redis usage in the batch impl for now.

mperham commented 4 years ago

Virtually 100% sure I now know the issue here. This Watch example shows the code literally handling the "redis: transaction failed" error by retrying which the current batch code does not do.

https://godoc.org/github.com/go-redis/redis#Client.Watch

I'm glad to hear that it happens very rarely; this is the only code in Faktory that uses Watch. I'll add retry support. Thank you so much for your patience and the clues necessary to track down the problem.

scottrobertson commented 4 years ago

Awesome, glad you found the issue! Thank you.

mperham commented 4 years ago

The fix will be in the 1.4.1 release in the next week.

scottrobertson commented 4 years ago

Thank you.