Lachim / redis

Automatically exported from code.google.com/p/redis
2 stars 0 forks source link

Timeout receiving bulk data from MASTER... #435

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1.Get a big 64 bits redis installation with just a master.
2.master has 5.37G of data
3.Start up a slave and let it SYNC.

What is the expected output? What do you see instead?
I'd expect the SYNC to work, no matter how much time it takes to transfer the 
dump file from the master

Instead I get "Timeout receiving bulk data from MASTER..." and the slave 
retries to SYNC, starting over again.

What version of the product are you using? On what operating system?
redis_version:2.1.10
arch_bits:64

CentOS 5.4 64 bits

Please provide any additional information below.
Is this configurable somewhere?

Original issue reported on code.google.com by jam...@gmail.com on 18 Jan 2011 at 12:50

GoogleCodeExporter commented 8 years ago
From replication.c

#define REDIS_REPL_TRANSFER_TIMEOUT 60

void replicationCron(void) {
    /* Bulk transfer I/O timeout? */
    if (server.masterhost && server.replstate == REDIS_REPL_TRANSFER &&
        (time(NULL)-server.repl_transfer_lastio) > REDIS_REPL_TRANSFER_TIMEOUT)
    {
        redisLog(REDIS_WARNING,"Timeout receiving bulk data from MASTER...");
        replicationAbortSyncTransfer();
    }

Original comment by jam...@gmail.com on 18 Jan 2011 at 1:54

GoogleCodeExporter commented 8 years ago
Can REDIS_REPL_TRANSFER_TIMEOUT be made configurable?

Original comment by jam...@gmail.com on 18 Jan 2011 at 1:54

GoogleCodeExporter commented 8 years ago
Hello, this can be made configurable, but the idea was that the timeout does 
not apply to the whole bulk transfer time, but the interval where we don't 
receive no data at all. I guess something is not working as expected. For now 
the obvious fix is to set the "60" to something higher, but I want to 
understand why it is not doing what I expected it to do, that is, to quit only 
if data transfer appears to stop at all for 60 seconds or more.

Original comment by anti...@gmail.com on 18 Jan 2011 at 3:30

GoogleCodeExporter commented 8 years ago
I think the issue is that the master takes more than 60 seconds to create the 
rdb file for the sync so that the transfer has not even started before the 
slave stops listening for it. It may need an initial setup timeout different to 
the timeout for the transfer. Else, some sort of heart beat while the master 
prepares the file. For really large DBs the time to create the dump on the 
master could be substantially greater than 60 secs

Original comment by Neill.Br...@gmail.com on 18 Jan 2011 at 9:06

GoogleCodeExporter commented 8 years ago
Any chance this can be looked into before 2.2.0 final?

Original comment by jam...@gmail.com on 20 Jan 2011 at 8:54

GoogleCodeExporter commented 8 years ago
This fix will enter for sure 2.2 stable :)

Original comment by anti...@gmail.com on 20 Jan 2011 at 9:00

GoogleCodeExporter commented 8 years ago
Fixed on master -> 
https://github.com/antirez/redis/commit/89a1433e69db5f7c996484672437616a16a6fe0a

This also introduced explicit PING in the master - slave link, so now a slave 
is much better at detecting a broken master link. Want to test it a bit more 
before backporting to 2.2.

If you have a chance to test it in the unstable branch please let me know.

Cheers,
Salvatore

Original comment by anti...@gmail.com on 20 Jan 2011 at 12:21

GoogleCodeExporter commented 8 years ago
Just tested  with a master and slave setup that took 62 seconds on the master 
to write the rdb file for SYNC to disc.

SYNC went fine. I attach log files.

Original comment by jam...@gmail.com on 24 Jan 2011 at 5:25

Attachments: