Open CrazyTennisFan opened 1 year ago
@JohnSully @msotheeswaran Hi John and Malavan, could you please help on this crash issue. Thank you.
Just to clarify, this happens with just 1 call to restore? You mentioned this happens when the server is very busy, what is the context of this, high QPS from clients(i.e. memtier_benchmark) or during BGSAVE or replication? Do you instructions to reproduce this issue?
Going to close this as there has been no update for 2 months, please re-open if you still need help.
Here is attached stack trace below and re-opening the PR.
1:20:M 18 Dec 2023 03:51:33.187 * Starting BGSAVE for SYNC with target: disk
1:20:M 18 Dec 2023 03:51:33.189 * Background saving started
1:22:M 18 Dec 2023 03:51:50.361 * DB saved on disk
1:22:M 18 Dec 2023 03:51:50.409 * RDB: 4381 MB of memory used by copy-on-write
1:19:M 18 Dec 2023 03:51:50.483 * Background saving terminated with success
=== KEYDB BUG REPORT START: Cut & paste starting from here ===
1:21:M 18 Dec 2023 03:51:51.220 # === ASSERTION FAILED OBJECT CONTEXT ===
1:21:M 18 Dec 2023 03:51:51.220 # Object type: 0
1:21:M 18 Dec 2023 03:51:51.220 # Object encoding: 8
1:21:M 18 Dec 2023 03:51:51.220 # Object refcount: 1
1:21:M 18 Dec 2023 03:51:51.220 # === ASSERTION FAILED ===
1:21:M 18 Dec 2023 03:51:51.220 # ==> db.cpp:316 'fInserted' is not true
------ STACK TRACE ------
Backtrace:
/opt/redis/redis-server 0.0.0.0:6379(dbAdd(redisDb*, redisObject*, redisObject*)+0x46) [0x640196]
/opt/redis/redis-server 0.0.0.0:6379(restoreCommand(client*)+0x43f) [0x66451f]
/opt/redis/redis-server 0.0.0.0:6379(call(client*, int)+0xa1) [0x67c3b1]
/opt/redis/redis-server 0.0.0.0:6379(processCommand(client*, int)+0x820) [0x67d3e0]
/opt/redis/redis-server 0.0.0.0:6379(processCommandAndResetClient(client*, int)+0x65) [0x5e7685]
/opt/redis/redis-server 0.0.0.0:6379(processInputBuffer(client*, bool, int)+0x174) [0x5ea254]
/opt/redis/redis-server 0.0.0.0:6379(processClients()+0xc9) [0x5eb749]
/opt/redis/redis-server 0.0.0.0:6379() [0x6a2692]
/opt/redis/redis-server 0.0.0.0:6379(beforeSleep(aeEventLoop*)+0x16e) [0x59bc4e]
/opt/redis/redis-server 0.0.0.0:6379(aeProcessEvents+0xe8) [0x597858]
/opt/redis/redis-server 0.0.0.0:6379(aeMain+0x37) [0x59b947]
/opt/redis/redis-server 0.0.0.0:6379(workerThreadMain(void*)+0x64) [0x6a2724]
/usr/lib64/libpthread.so.0(+0x8f4b) [0x7f7354eb7f4b]
/usr/lib64/libc.so.6(clone+0x3f) [0x7f7354def7ef]
------ INFO OUTPUT ------
# Server
redis_version:255.255.255
patch_version:6.2.3
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:ace1ce722e0c9adf
redis_mode:standalone
os:Linux
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:0.0.0
process_id:1
process_supervised:no
run_id:6a6
tcp_port:6379
server_time_usec:1702871511222520
uptime_in_seconds:37
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:8372695
executable:redis-server
config_file:redis.conf
instance_id:285
# Clients
connected_clients:188
cluster_connections:0
maxclients:10010
client_recent_max_input_buffer:277
client_recent_max_output_buffer:0
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0
current_client_thread:2
rx_controlled_clients:0
total_rx_controlled:0
proxy_header_error:0
thread_0_clients:109
thread_1_clients:40
thread_2_clients:40
# Memory
used_memory:5034402456
used_memory_human:4.69G
used_memory_rss:4608233472
used_memory_rss_human:4.29G
used_memory_peak:5041236360
used_memory_peak_human:4.70G
used_memory_peak_perc:99.86%
used_memory_overhead:1116224335
used_memory_startup:4415752
used_memory_dataset:3918178121
used_memory_dataset_perc:77.90%
allocator_allocated:5035115232
allocator_active:5124317184
allocator_resident:5196357632
total_system_memory:16131584000
total_system_memory_human:15.02G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:8589934592
maxmemory_human:8.00G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1.02
allocator_frag_bytes:89201952
allocator_rss_ratio:1.01
allocator_rss_bytes:72040448
rss_overhead_ratio:0.89
rss_overhead_bytes:-588124160
mem_fragmentation_ratio:0.92
mem_fragmentation_bytes:-426153640
mem_not_counted_for_evict:536874490
mem_replication_backlog:536870912
mem_clients_slaves:20512
mem_clients_normal:3630271
mem_aof_buffer:3584
mem_allocator:jemalloc-5.2.1
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:0
storage_provider:none
# Persistence
loading:0
current_cow_size:4588204032
current_cow_size_age:1
current_fork_perc:0.00
current_save_keys_processed:13224961
current_save_keys_total:0
rdb_changes_since_last_save:19665
rdb_bgsave_in_progress:0
rdb_last_save_time:1702871510
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:17
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:4594831360
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0
aof_current_size:3667308016
aof_base_size:3662362982
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0
# Stats
total_connections_received:225
total_commands_processed:114248
instantaneous_ops_per_sec:23415
total_net_input_bytes:9839098
total_net_output_bytes:381757005
instantaneous_input_kbps:5551.42
instantaneous_output_kbps:214416.84
rejected_connections:0
sync_full:1
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:0
evicted_keys:0
keyspace_hits:701
keyspace_misses:703
pubsub_channels:1
pubsub_patterns:0
latest_fork_usec:0
total_forks:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:20203
dump_payload_sanitizations:0
total_reads_processed:42561
total_writes_processed:42672
instantaneous_lock_contention:3
avg_lock_contention:1.156250
storage_provider_read_hits:0
storage_provider_read_misses:0
# Replication
role:master
connected_slaves:1
slave0:ip=172.31.14.145,port=6379,state=send_bulk,offset=0,lag=0
master_failover_state:no-failover
master_replid:<id>
master_replid2:<id>
master_repl_offset:4949602
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:536870912
repl_backlog_first_byte_offset:1
repl_backlog_histlen:4949602
# CPU
used_cpu_sys:6.210537
used_cpu_user:35.800167
used_cpu_sys_children:0.000000
used_cpu_user_children:0.000000
server_threads:3
long_lock_waits:0
used_cpu_sys_main_thread:0.205846
used_cpu_user_main_thread:0.354261
# Modules
module:name=traffic_control,ver=1,api=1,filters=0,usedby=[],using=[],options=[]
module:name=auth,ver=1,api=1,filters=0,usedby=[],using=[],options=[]
# Commandstats
cmdstat_info:calls=29,usec=2224,usec_per_call=76.69,rejected_calls=0,failed_calls=0
cmdstat_hexists:calls=11,usec=11,usec_per_call=1.00,rejected_calls=0,failed_calls=0
cmdstat_client:calls=6,usec=4,usec_per_call=0.67,rejected_calls=0,failed_calls=0
cmdstat_auth:calls=225,usec=1727,usec_per_call=7.68,rejected_calls=0,failed_calls=0
cmdstat_ping:calls=113,usec=87,usec_per_call=0.77,rejected_calls=0,failed_calls=0
cmdstat_publish:calls=54,usec=275,usec_per_call=5.09,rejected_calls=0,failed_calls=0
cmdstat_select:calls=1645,usec=930,usec_per_call=0.57,rejected_calls=0,failed_calls=0
cmdstat_hget:calls=172,usec=611,usec_per_call=3.55,rejected_calls=32,failed_calls=0
cmdstat_subscribe:calls=3,usec=3,usec_per_call=1.00,rejected_calls=0,failed_calls=0
cmdstat_replconf:calls=57,usec=151,usec_per_call=2.65,rejected_calls=0,failed_calls=0
cmdstat_hmset:calls=0,usec=0,usec_per_call=0.00,rejected_calls=129,failed_calls=0
cmdstat_psync:calls=1,usec=1704,usec_per_call=1704.00,rejected_calls=18,failed_calls=0
cmdstat_exists:calls=254,usec=519,usec_per_call=2.04,rejected_calls=251,failed_calls=0
cmdstat_hgetall:calls=974,usec=5443,usec_per_call=5.59,rejected_calls=73,failed_calls=0
cmdstat_restore:calls=39346,usec=134616,usec_per_call=3.42,rejected_calls=0,failed_calls=19681
cmdstat_replping:calls=0,usec=0,usec_per_call=0.00,rejected_calls=19,failed_calls=0
# Errorstats
errorstat_BUSYKEY:count=19681
errorstat_LOADING:count=503
errorstat_NOAUTH:count=19
# Cluster
cluster_enabled:0
# Keyspace
db0:keys=5722553,expires=0,avg_ttl=0,cached_keys=5722553
db1:keys=5676570,expires=0,avg_ttl=0,cached_keys=5676570
db2:keys=445808,expires=0,avg_ttl=0,cached_keys=445808
db3:keys=852320,expires=0,avg_ttl=0,cached_keys=852320
db4:keys=2231,expires=0,avg_ttl=0,cached_keys=2231
db5:keys=71360,expires=0,avg_ttl=0,cached_keys=71360
db6:keys=30616,expires=0,avg_ttl=0,cached_keys=30616
db7:keys=13,expires=1,avg_ttl=1416595,cached_keys=13
db11:keys=1950,expires=0,avg_ttl=0,cached_keys=1950
db12:keys=259777,expires=0,avg_ttl=0,cached_keys=259777
db16:keys=2,expires=0,avg_ttl=0,cached_keys=2
db17:keys=3,expires=0,avg_ttl=0,cached_keys=3
db20:keys=604530,expires=0,avg_ttl=0,cached_keys=604530
db21:keys=2279,expires=0,avg_ttl=0,cached_keys=2279
# KeyDB
variant:enterprise
license_status:OK
mvcc_depth:1
------ CLIENT LIST OUTPUT -----
---
---
------ CURRENT CLIENT INFO ------
id=200 addr=<ip> laddr=<ip> fd=186 name= age=2 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=40954 argv-mem=219 obl=0 oll=0 omem=0 tot-mem=61699 events=r cmd=restore user=default redir=-1 peer=<ip> network=vpc
argv[0]: 'restore'
argv[1]: 'key_name'
argv[2]: '0'
argv[3]: '
@®®'
argv[4]: 'REPLACE'
1:21:M 18 Dec 2023 03:51:51.223 # key 'key_name' found in DB containing the following object:
1:21:M 18 Dec 2023 03:51:51.223 # Object type: 4
1:21:M 18 Dec 2023 03:51:51.223 # Object encoding: 5
1:21:M 18 Dec 2023 03:51:51.223 # Object refcount: 1
------ MODULES INFO OUTPUT ------
------ FAST MEMORY TEST ------
1:21:M 18 Dec 2023 03:51:51.223 # main thread terminated
1:21:M 18 Dec 2023 03:51:51.223 # Bio thread for job type #0 terminated
1:21:M 18 Dec 2023 03:51:51.224 # Bio thread for job type #1 terminated
1:21:M 18 Dec 2023 03:51:51.224 # Bio thread for job type #2 terminated
Fast memory test PASSED, however your memory can still be broken. Please run a memory test for several hours if possible.
=== KEYDB BUG REPORT END. Make sure to include from START to END. ===
Please report the crash by opening an issue on github:
https://github.com/JohnSully/KeyDB/issues
Suspect RAM error? Use keydb-server --test-memory to verify it.
As mentioned in top comment, issue is not reproducible locally. We are still trying different scenarios to reproduce locally.
@MalavanEQAlpha please reopen this issue, Thanks
I've been told this reproduces still even with your other fix @MalavanEQAlpha
In my side, I found an issue, which is hard to reproduce but it could crash the keydb application once happen. When keydb is very busy and lazyfree-lazy-server-del is set to yes, if one client just run the restoreCommand with replace option, then our codes will run as below:
In the dbDelete function, we can indicate the key will be deleted later, but if in this case, a same name key want to insert, then dbAdd() function will crash as below due to the serverAssertWithInfo()
I am not sure why if it really happen then serverAssertWithInfo will fail. Do we have any idea to avoid the keydb crash? Thanks