card323 / redis

Automatically exported from code.google.com/p/redis
0 stars 0 forks source link

Redis 2.2.11 hangs while dumping #602

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Redis 2.2.11

While insertings lots of datasets (1k/sec) and then performing a SAVE or BGSAVE 
Redis hangs while SAVE runs :( It's then not possible to GET or SET any data.

Is there a workaround for that?

Thanks,
Robert

Original issue reported on code.google.com by r...@robhost.de on 5 Jul 2011 at 2:45

GoogleCodeExporter commented 8 years ago
Guarav, Pieter, thanks for spotting/tracing the AOF rewrite bug. It is fixed on 
2.4 and unstable. Releasing RC6 in a few with the fix. Cheers.

Original comment by anti...@gmail.com on 9 Aug 2011 at 9:47

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
Pieter,

I am experiencing the same iowait issue on close (due to delete of the old 
file) on a machine with no RAID disk as with the RAIDed disk. So the problem, 
as you rightly pointed out, is with the delete of the old file.  

One thing I completely overlooked was the delete performance during rename. 
Indeed, the rename now takes longer if moved to after the close.

[8002] 08 Aug 22:08:18.482896 * Background append only file rewriting started 
by pid 8340
[8340] 08 Aug 22:11:37.948196 * SYNC append only file rewrite performed
[8002] 08 Aug 22:11:38.291346 * Background append only file rewriting 
terminated with success
[8002] 08 Aug 22:11:38.291439 * Parent diff flushed into the new append log 
file with success (0 bytes)
[8002] 08 Aug 22:11:38.291458 * Append only file successfully rewritten.
[8002] 08 Aug 22:11:38.291477 * Before closing appendonly file.
[8002] 08 Aug 22:11:38.291497 * Before aof_fsync.
[8002] 08 Aug 22:11:38.291530 * The new append only file was selected for 
future appends.
[8002] 08 Aug 22:11:38.291550 * After aofUpdateCurrentSize.
[8002] 08 Aug 22:12:01.032106 * After rename. 

The right approach will be to rename to a temporary file and let a utility 
thread delete the old file. Don't know what the priority for fixing that would 
be for you guys.

In lieu of the utility thread approach I plan to deploy a dual rename approach. 
I tested a simple approach of renaming the old appendonly file to 
appendonly.timestamp and letting a cron job take care of the delete

        sprintf(appendonlyBackupfilename, "%s.%ld", server.appendfilename, time(NULL));

        if (rename(server.appendfilename, appendonlyBackupfilename) == -1) {
            redisLog(REDIS_WARNING,"Can't rename the old append only file to backup file: %s", strerror(errno));
            if (server.appendfd != -1) close(server.appendfd);
            goto cleanup;
        }
        redisLog(REDIS_NOTICE,"After rename appendfilename to appendonlyBackupfilename.");

        if (rename(tmpfile,server.appendfilename) == -1) {
            redisLog(REDIS_WARNING,"Can't rename the temp append only file into the stable one: %s", strerror(errno));
            if (server.appendfd != -1) close(server.appendfd);
            goto cleanup;
        }
        redisLog(REDIS_NOTICE,"After rename tmpfile to appendfilename.");

This runs as expected and does not block the main thread:

[16138] 09 Aug 12:23:42.856713 * Background append only file rewriting started 
by pid 16166
[16166] 09 Aug 12:27:03.462604 * SYNC append only file rewrite performed
[16138] 09 Aug 12:27:03.777022 * Background append only file rewriting 
terminated with success
[16138] 09 Aug 12:27:03.777089 * Parent diff flushed into the new append log 
file with success (0 bytes)
[16138] 09 Aug 12:27:03.777132 * Append only file successfully rewritten.
[16138] 09 Aug 12:27:03.777151 * Before closing appendonly file.
[16138] 09 Aug 12:27:03.777171 * Before aof_fsync.
[16138] 09 Aug 12:27:03.777199 * The new append only file was selected for 
future appends.
[16138] 09 Aug 12:27:03.777219 * After aofUpdateCurrentSize.
[16138] 09 Aug 12:27:03.777267 * After rename appendfilename to 
appendonlyBackupfilename.
[16138] 09 Aug 12:27:03.777296 * After rename tmpfile to appendfilename.

Let me know what your thoughts are for this approach.

Thanks for all your help. I really appreciate your wisdom.

Thanks,
Gaurav.

Original comment by gauravk...@gmail.com on 9 Aug 2011 at 9:01

GoogleCodeExporter commented 8 years ago
Gaurav,

The main problem with the 2-step rename is that there is no "appendonly.aof" if 
the process crashes in between the renames. I have created a series of patches 
that use libeio to defer closing the old file descriptor to the background. 
Initial tests show that is works as expected. The patchset also includes some 
speedups (~20% throughput improvement when AOF is enabled). The code is located 
here: https://github.com/pietern/redis/tree/2.4-eio. Can you check if this 
resolves the issue for you?

Thanks,
Pieter

Original comment by pcnoordh...@gmail.com on 18 Aug 2011 at 2:31

GoogleCodeExporter commented 8 years ago
Pieter,

This looks good. I will give this a shot.

re: "The main problem with the 2-step rename is that there is no 
"appendonly.aof" if the process crashes in between the renames."
In a non-transaction system this is always the case. This is something I can 
live with for now.

BTW, for others who are facing this problem, another low-tech alternate 
approach I had to use on a production system which I cannot take out of 
rotation to upgrade, is as follows:

redis> bgrewriteaof
sh> mv appendonly.aof appendonly.aof.<ts>

The redis process continues to write to appendonly.aof.<ts>. When the 
background process completes redis able to rename to appendonly.aof without 
incurring the cost of the delete. I then just delete the appendonly.aof.<ts> at 
a later time once -e appendonly.aof succeeds. Works well for me. Again, 
low-tech and works for me for now.

Thanks again for the patch. I will get back to you regarding the performance of 
the eio_close.

Gaurav.

Original comment by gauravk...@gmail.com on 18 Aug 2011 at 11:27

GoogleCodeExporter commented 8 years ago
Hello, we have a new branch with the fix made by Pieter, a bit modified in 
order to avoid using libeio (we use a background job systems that for now only 
handle closes, but in the future may do much more like write/fsync against the 
AOF file itself and so forth).

For now the branch is based on the unstable branch code, but will be ported to 
2.4 tomorrow. I'll post the 2.4-bio branch here when available tomorrow.

Cheers,
Salvatore

Original comment by anti...@gmail.com on 13 Sep 2011 at 4:43

GoogleCodeExporter commented 8 years ago
The async close() on AOF rewrite is now merged in both unstable and 2.4 branch 
on github.
Now I'm going to test it a bit more by hand but everything seems to be working 
well.

Please if you can test the new code and report back here. Taking the issue open 
for now. Thanks for your help in solving this issue.

Salvatore

Original comment by anti...@gmail.com on 14 Sep 2011 at 8:56

GoogleCodeExporter commented 8 years ago

Original comment by anti...@gmail.com on 14 Sep 2011 at 8:57

GoogleCodeExporter commented 8 years ago
Salvatore,

Thanks very much for this. I really appreciate the work you, Pieter and the 
other redis contributors are doing. Redis is proving to be a very important 
piece of our architecture.

I will test this out on our test servers today and will report back to you. As 
a general rule I prefer running stable releases on production. Is there an 
estimate on when we can get 2.4 branch to stable?

Thanks,
Gaurav.

Original comment by gauravk...@gmail.com on 14 Sep 2011 at 2:41

GoogleCodeExporter commented 8 years ago
What's the state of this? I am being afflicted by something pretty close to it. 
I have a 20+ GB (in memory) dataset, and Redis hangs for minutes when doing a 
BGREWRITEAOF. We updated from 2.2 to 2.4.5 but the problem persists.

Original comment by jose.junior on 16 Jan 2012 at 1:28

GoogleCodeExporter commented 8 years ago
I have same problem on ubuntu with software raid WTH OUT AOF just plain 
bgsaving, saving in background thread somehow blocks main thread. Patched redis 
to log more and and calling "fsync" leads to main thread hang-up.

Original comment by bogu...@gmail.com on 13 Mar 2012 at 12:29