segfault after 2 days uptime with redis 2.1.10 as queue server

GoogleCodeExporter commented 8 years ago

--- General reflections
It is a queue server, which means we only use lists.
It runs on our development server, we don't have it in production
yet.

The sole commands used here are: blpop, rpush, lpush.
We have +- 10 workers block popping on a dozen of keys for 1 second.
The php-fcgi web frontend process are 5 in total an push their
jobs to the queue servers.

The workers do this endlessly.. We used redis as gearman substitution here.

-------------------------

I don't really have something more I can tell except, that I was killing the 
worker processes. which means they closed their connections to redis, and it 
seems that redis, segfaulted at that
moment..

Hope this helps..

-----------------------------
The dump says:

[5761] 17 Jan 09:19:09 # ======= Ooops! Redis 2.1.10 got signal: -11- =======
[5761] 17 Jan 09:19:09 # redis_version:2.1.10
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
process_id:5761
uptime_in_seconds:207613
uptime_in_days:2
lru_clock:1598962
used_cpu_sys:29.31
used_cpu_user:20.95
used_cpu_sys_childrens:0.00
used_cpu_user_childrens:0.00
connected_clients:0
connected_slaves:0
blocked_clients:0
used_memory:784168
used_memory_human:765.79K
used_memory_rss:1740800
mem_fragmentation_ratio:2.22
use_tcmalloc:0
loading:0
aof_enabled:0
changes_since_last_save:3896
bgsave_in_progress:0
last_save_time:1295044736
bgrewriteaof_in_progress:0
total_connections_received:658
total_commands_processed:700327
expired_keys:0
evicted_keys:0
keyspace_hits:4201
keyspace_misses:698074
hash_max_zipmap_entries:64
hash_max_zipmap_value:512
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:master

[5761] 17 Jan 09:19:09 # 
/blade/exec/10.10.0.1/redis/bin/redis-server(beforeSleep+0x43) [0x40e993]
[5761] 17 Jan 09:19:09 # 
/blade/exec/10.10.0.1/redis/bin/redis-server(beforeSleep+0x43) [0x40e993]
[5761] 17 Jan 09:19:09 # 
/blade/exec/10.10.0.1/redis/bin/redis-server(aeMain+0x21) [0x40adf1]
[5761] 17 Jan 09:19:09 # 
/blade/exec/10.10.0.1/redis/bin/redis-server(main+0xfd) [0x40fd6d]
[5761] 17 Jan 09:19:09 # /lib/libc.so.6(__libc_start_main+0xfe) [0x7f0874056d8e]
[5761] 17 Jan 09:19:09 # /blade/exec/10.10.0.1/redis/bin/redis-server() 
[0x40a129]

-----------------------
The config is

daemonize yes
pidfile /blade/run/pids/redis-10.10.0.1.pid

port 6379
bind 10.10.0.1

timeout 0

loglevel notice
logfile /blade/logs/redis-10.10.0.1.log

dir /blade-dbs/10.10.0.1/redis/
dbfilename dump.rdb
databases 1

maxclients 2048
maxmemory 128mb

appendonly no
vm-enabled no

glueoutputbuf yes
hash-max-zipmap-entries 64
hash-max-zipmap-value 512
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
activerehashing yes

Original issue reported on code.google.com by p...@p-dw.com on 17 Jan 2011 at 8:36

GoogleCodeExporter commented 8 years ago

Hello,

thank you for reporting, this seems to be definitely to be a bug, it is 
probably the first 2.2 crash reported.
I think this is due to interaction between blocking pop and clients closing the 
connection. I'll fix this today as otherwise this will block 2.2 release.

Thank you for reporting!

Currently not stopping the workers should avoid the bug.

Salvatore

Original comment by anti...@gmail.com on 17 Jan 2011 at 8:53

GoogleCodeExporter commented 8 years ago

The issue was fixed by Pieter Noordhuis, thank you for reporting. You should 
not experience any further problem while killing your clients.

Post mortem: the issue was caused by the fact that we moved blocking pop 
operations to a new abstraction we have in the event loop. Basically when the 
clients are ready to be unblocked, we no longer do this synchronously when the 
PUSH happens against the list, and we are in the contest of another client. 
Instead we put the client to unblock into a list of clients that must be 
unblocked. This clients are unblocked in a function called beforeSleep() that 
is called every time we are going to re-enter the event loop.

However we moved the blocking pop with the new system, but forgot to perform 
the appropriate cleanup in this list of clients to resume when a client was 
freed. This had the effect of resuming a client that was no longer valid or 
existing, with the effect of a crash.

Original comment by anti...@gmail.com on 17 Jan 2011 at 9:20

Changed state: Fixed

Lachim / redis

segfault after 2 days uptime with redis 2.1.10 as queue server #434