Redis crashed on replicating with RDB error

GoogleCodeExporter commented 8 years ago

What version of Redis you are using, in what kind of Operating System?

2.2.8 with jemalloc-static branch

What is the problem you are experiencing?

One of our redis primaries lost its network connection to the rest of our 
machines for a minute. When the primary regained connectivity, the slave tried 
to re-sync, but when loading the DB in memory, crashed with the attached 
traceback.

What steps will reproduce the problem?

When we noticed the issue a few hours later, we re-synced the secondary and 
this time it went through ok.

Do you have an INFO output? Please past it here.

Also included as part of crash.

If it is a crash, can you please paste the stack trace that you can find in
the log file or on standard output? This is really useful for us!

Please provide any additional information below.

[4943] 09 Aug 14:56:48 * 10000 changes in 30 seconds. Saving...
[4943] 09 Aug 14:56:49 * Background saving started by pid 32619
[4943] 09 Aug 14:57:15 # MASTER time out: no data nor PING received...
[4943] 09 Aug 14:57:15 * Connecting to MASTER...
[4943] 09 Aug 14:57:15 * MASTER <-> SLAVE sync started: SYNC sent
[32619] 09 Aug 14:59:28 * DB saved on disk
[4943] 09 Aug 14:59:29 * Background saving terminated with success
[4943] 09 Aug 15:00:50 * MASTER <-> SLAVE sync: receiving 6476466903 bytes from 
master
[4943] 09 Aug 15:04:48 * MASTER <-> SLAVE sync: Loading DB in memory
[4943] 09 Aug 15:05:26 # !!! Software Failure. Press left mouse button to 
continue
[4943] 09 Aug 15:05:26 # Guru Meditation: "Unknown RDB encoding type" #rdb.c:653
[4943] 09 Aug 15:05:26 # (forcing SIGSEGV in order to print the stack trace)
[4943] 09 Aug 15:05:26 # ======= Ooops! Redis 2.2.8 got signal: -11- =======
[4943] 09 Aug 15:05:26 # redis_version:2.2.8
redis_git_sha1:336dd92f
redis_git_dirty:1
arch_bits:64
multiplexing_api:epoll
process_id:4943
uptime_in_seconds:3777171
uptime_in_days:43
lru_clock:1266804
used_cpu_sys:487543.75
used_cpu_user:408085.88
used_cpu_sys_childrens:1381692.62
used_cpu_user_childrens:163462.16
connected_clients:0
connected_slaves:0
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:3270445944
used_memory_human:3.05G
used_memory_rss:3272544256
mem_fragmentation_ratio:1.00
mem_allocator:jemalloc-2.2.1
loading:1
aof_enabled:0
changes_since_last_save:0
bgsave_in_progress:0
last_save_time:1312901969
bgrewriteaof_in_progress:0
total_connections_received:1
total_commands_processed:5429983271
expired_keys:0
evicted_keys:0
keyspace_hits:5299109390
keyspace_misses:4575106
hash_max_zipmap_entries:512
hash_max_zipmap_value:64
pubsub_channels:0
pubsub_patterns:0
vm_enabled:0
role:slave
master_host:10.120.37.111
master_port:6490
master_link_status:d
[4943] 09 Aug 15:05:26 # ./redis-server(_redisPanic+0x58) [0x42ded8]
[4943] 09 Aug 15:05:26 # ./redis-server(_redisPanic+0x58) [0x42ded8]
[4943] 09 Aug 15:05:26 # ./redis-server(rdbGenericLoadStringObject+0x7a) 
[0x41d85a]
[4943] 09 Aug 15:05:26 # ./redis-server(rdbLoadObject+0x331) [0x41dc11]
[4943] 09 Aug 15:05:26 # ./redis-server(rdbLoad+0x155) [0x41df55]
[4943] 09 Aug 15:05:26 # ./redis-server(readSyncBulkPayload+0xf0) [0x41c570]
[4943] 09 Aug 15:05:26 # ./redis-server(aeProcessEvents+0x153) [0x40c9a3]
[4943] 09 Aug 15:05:26 # ./redis-server(aeMain+0x2e) [0x40cbee]
[4943] 09 Aug 15:05:26 # ./redis-server(main+0xf7) [0x411c77]
[4943] 09 Aug 15:05:26 # /lib/libc.so.6(__libc_start_main+0xfd) [0x7f85f4c13c4d]
[4943] 09 Aug 15:05:26 # ./redis-server() [0x40bf19]

Original issue reported on code.google.com by mik...@instagram.com on 9 Aug 2011 at 8:28

GoogleCodeExporter commented 8 years ago

What is the Redis version of the master? Even though a RDB incompatibility 
should be detected when reading the first couple of bytes, this could be 
useful. Do you still have the temporary RDB file that this slave was reading 
when it died? If so, could you run the redis-check-dump utility against it to 
see if it is corrupt to begin with?

Thanks,
Pieter

Original comment by pcnoordh...@gmail.com on 11 Aug 2011 at 12:16

GoogleCodeExporter commented 8 years ago

Crosspost from ML (by Salvatore):

That's the problem you are experiencing IMHO:

What's new in Redis 2.2.11
==========================

* Solved a never reported but possibly critical bug in the AOF and RDB
persistence, introduced with the new version of the iterator: In very rare
circumstances the AOF (after rerwite) or the rdb file may contain the same
key more than one time.

Original comment by pcnoordh...@gmail.com on 11 Aug 2011 at 1:21

GoogleCodeExporter commented 8 years ago

I don't think this issue is related to the iterator bugs: the "Unknown RDB 
encoding type" error is more likely to point in the direction of a byte-level 
corruption. A single key being present more than once would result in an error 
in the main loop of the code responsible for loading an RDB.

Original comment by pcnoordh...@gmail.com on 11 Aug 2011 at 1:34

GoogleCodeExporter commented 8 years ago

Pieter: I would close this if not reported for a more recent version of Redis 
as well, do you agree?

Salvatore

Original comment by anti...@gmail.com on 14 Sep 2011 at 3:41

GoogleCodeExporter commented 8 years ago

We haven't seen this again, so I think closing makes sense, will 
re-report/re-open if we ever see it again.

Original comment by mik...@instagram.com on 14 Sep 2011 at 9:49

GoogleCodeExporter commented 8 years ago

Thanks, now that I'm thinking at this, I'm sure that attaching a 2.2.x slave to 
a >= 2.4 master will generate such an error in most cases. Maybe this was the 
cause, or maybe not, but in general it is not a good idea that incompatible 
versions will generate a crash.

So I'll assume (even if this may not be the case) that this was the reason and 
will try to fix this bug.

However at some point we changed the 2.4 RDB file version, so in newer versions 
of Redis 2.4 the bug you'll see with a 2.2 slave should be more clear than a 
crash, but it is worth to try and investigate. This are exactly the kind of 
issues I want to solve before continuing with the scripting/cluster development.

Marking this bug as Accepted for the above reasons. Thanks for the help!

Cheers,
Salvatore

Original comment by anti...@gmail.com on 14 Sep 2011 at 10:22

Changed state: Accepted

Lachim / redis

Redis crashed on replicating with RDB error #629