Netflix / Priam

Co-Process for backup/recovery, Token Management, and Centralized Configuration management for Cassandra.
Apache License 2.0
1.04k stars 294 forks source link

Avoid polluting the page cache when backing up files #316

Closed jasobrown closed 9 years ago

jasobrown commented 10 years ago

Current implementation naively reads files to be backed up off disk, which will put those blocks into the OS page cache. If you have a lot of memory on the machine, there is a negligible cost (assuming you can even see it). However, on machines with less memory, pulling all the cold sttables through the page cache will evict the hot blocks quicly, and thus you can run into performance degradation, with lots of competing IO and page cache thrashing.

It should be possible to avoid reading the backup files through the page cache - I think it's just an fadvise flag. However, I don't believe those options are exposed in the JDk, and thus might need to use jna to set it. Cassandra itself already does a lot of these kind of techniques, so it should be reasonably straight-forward to add it into priam.

danchia commented 10 years ago

@jasobrown I'm happy to take a stab at this - do you have a quick pointer into where in the Cassandra code they do this? If not I'll go code diving

jasobrown commented 10 years ago

Take a look at the CLibrary.trySkipCache() method: https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/utils/CLibrary.java#L214. We'll need to pull in JNA, of course, but I don't see that as a problem.

Reading the man page for for posix_fadvise, however, I see this:

POSIX_FADV_DONTNEED attempts to free cached pages associated with the specified region.  
This is useful, for example, while streaming large  files.   
A  program  may  periodically request the kernel to free cached data that has already been used, 
so that more useful cached pages are not discarded instead.

That's the flag I was thinking about for this. We'll have to dig in further to understand the implications - as you don't want to evict good page cache entries (used by c*) due to priam's setting that flag (on same files) during backups. However, given memory constraints on some machines, this may be a better trade off than the current implementation. Only testing will tell.

danchia commented 10 years ago

Interesting.

I agree that POSIX_FADV_DONTNEED might not be the best idea, since it might force flush cached pages that we actually want.

I was thinking of using O_DIRECT instead, which uses direct disk access. There are apparently some kernel level restrictions on IO alignment when using O_DIRECT, but I'm sure those can be worked out. Using O_DIRECT should let us bypass the buffer cache and basically achieve the effect we intend.

(Someone talks about his experience with Lucene near the end of this post: http://blog.mikemccandless.com/2010/06/lucene-and-fadvisemadvise.html)

jasobrown commented 10 years ago

From the open() man page:

"The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind-controlling substances."--Linus

:)

Doing some googling on O_DIRECT, it seems like there's, at best, mixed feelings out it. However, that doesn't mean you shouldn't try out some experiments to see what happens. Looking forward to seeing how this goes, thanks for taking it on!

timiblossom commented 10 years ago

If anybody is facing this problem, please upgrade your instance type to have sufficient memory (non JVM memory for the OS to do the file cache). Since I don't hear this for a while, I am going to close it now as it is not a problem for us.

timiblossom commented 9 years ago

@danchia since you are still using Priam, lets keep this open if you have the patch :).

danchia commented 9 years ago

It turns out we ended up moving to bigger machines for other reasons, so this didn't affect us anymore.