archiecobbs / s3backer

FUSE/NBD single file backing store via Amazon S3
Other
535 stars 77 forks source link

Reduce redundant page cache entries when using disk-based block cache. #176

Closed Nikratio closed 2 years ago

Nikratio commented 2 years ago

Currently, there will often be two copies of the same data in the page cache: one for the data in the FUSE filesystem, and one for the (same) data in the block cache file.

This patch adds explicit posix_fadvise() hints to the kernel that the page cache entries for the block cache file are not needed.

It is not clear if this will work when huge pages are in use, but will do no harm in this case.

It would be preferable if there was a way to flag the entire file as not needing to be cached, but unfortunately this isn't possible (O_DIRECT comes close, but depends on support from the underlying filesystem and requires writes and reads to be page-aligned).

archiecobbs commented 2 years ago

It would be preferable if there was a way to flag the entire file as not needing to be cached

Why wouldn't something like this work?

diff --git a/configure.ac b/configure.ac
index f8bbe2a..c2ab573 100644
--- a/configure.ac
+++ b/configure.ac
@@ -104,6 +104,7 @@ esac

 # Check for some O/S specific functions
 AC_CHECK_DECLS(fdatasync)
+AC_CHECK_DECLS([posix_fadvise], [], [], [[#include <fcntl.h>]])

 # Check for required header files
 AC_CHECK_HEADERS(assert.h ctype.h curl/curl.h err.h errno.h expat.h pthread.h stdarg.h stddef.h stdint.h stdio.h stdlib.h string.h syslog.h time.h unistd.h sys/queue.h sys/statvfs.h openssl/bio.h openssl/buffer.h openssl/evp.h openssl/hmac.h openssl/md5.h zlib.h, [],
diff --git a/dcache.c b/dcache.c
index a7ee732..b5877a4 100644
--- a/dcache.c
+++ b/dcache.c
@@ -321,6 +321,14 @@ retry:
         }
     }

+#endif
+
+#if HAVE_DECL_POSIX_FADVISE
+
+    /* Advise the kernel to not cache the cache file (redundant if already caching the FUSE file) */
+    if ((r = posix_fadvise(priv->fd, priv->data, 0, POSIX_FADV_DONTNEED)) != 0)
+        (*priv->log)(LOG_ERR, "error posix_fadvise()'ing cache file `%s': %s", priv->filename, strerror(r));
+
 #endif

     /* Done */
archiecobbs commented 2 years ago

It would be preferable if there was a way to flag the entire file as not needing to be cached

Why wouldn't something like this work?

diff --git a/configure.ac b/configure.ac
index f8bbe2a..c2ab573 100644
--- a/configure.ac
+++ b/configure.ac
@@ -104,6 +104,7 @@ esac

 # Check for some O/S specific functions
 AC_CHECK_DECLS(fdatasync)
+AC_CHECK_DECLS([posix_fadvise], [], [], [[#include <fcntl.h>]])

 # Check for required header files
 AC_CHECK_HEADERS(assert.h ctype.h curl/curl.h err.h errno.h expat.h pthread.h stdarg.h stddef.h stdint.h stdio.h stdlib.h string.h syslog.h time.h unistd.h sys/queue.h sys/statvfs.h openssl/bio.h openssl/buffer.h openssl/evp.h openssl/hmac.h openssl/md5.h zlib.h, [],
diff --git a/dcache.c b/dcache.c
index a7ee732..b5877a4 100644
--- a/dcache.c
+++ b/dcache.c
@@ -321,6 +321,14 @@ retry:
         }
     }

+#endif
+
+#if HAVE_DECL_POSIX_FADVISE
+
+    /* Advise the kernel to not cache the cache file (redundant if already caching the FUSE file) */
+    if ((r = posix_fadvise(priv->fd, priv->data, 0, POSIX_FADV_DONTNEED)) != 0)
+        (*priv->log)(LOG_ERR, "error posix_fadvise()'ing cache file `%s': %s", priv->filename, strerror(r));
+
 #endif

     /* Done */
Nikratio commented 2 years ago

This doesn't work because posix_fadvise() only takes effect once, it applies to data that is in the page cache at that point. It does not affect any data that is loaded afterwards.

archiecobbs commented 2 years ago

See if 018964d accomplishes what you want.

From the man page for posix_fadvise(2) I don't see any need to worry about page alignment or page size; it's up to the kernel to sort out those issues.

Nikratio commented 2 years ago

From posix_fadvise(2):

Requests to discard partial pages are ignored. It is preferable to preserve needed data than discard unneeded data. If the application requires that data be considered for discarding, then offset and len must be page-aligned.

Are we talking about the same manpage?

Nikratio commented 2 years ago

https://github.com/archiecobbs/s3backer/commit/018964d2a70e387a781f10acc4204a9a1ea33ccd looks correct, yes.

I'm surprised that you introduced a new flag for this. Do you have a case in mind where this would not be desirable?

Nikratio commented 2 years ago

Btw, I would have preferred for you to merge my pull request instead of creating your own commit based on it. That way, original authorship would have been preserved.

archiecobbs commented 2 years ago

Are we talking about the same manpage?

No - but I interpreted that paragraph differently, i.e., to mean that any partial pages at the beginning or end of the range would be ignored but the page-aligned middle portion would not... but re-reading it it's not clear and your interpretation is a safer one.

In any case, in my patch all requests will be page-aligned, thanks to this bit:

    if (header.data_align != getpagesize()) {
        (*priv->log)(LOG_ERR, "invalid cache file `%s': created with alignment %u != %u",
          priv->filename, header.data_align, getpagesize());
        goto fail3;
    }

I'm surprised that you introduced a new flag for this. Do you have a case in mind where this would not be desirable?

No but I want people to be able to decide for themselves. Translation: if this change causes unexpected problems, it's their fault :)

Btw, I would have preferred for you to merge my pull request instead of creating your own commit based on it. That way, original authorship would have been preserved.

Sorry about that. My patch is somewhat different from yours so it wasn't practical.

Nikratio commented 2 years ago

Ack. But I think this still does not solve the potential issues when using transparent huge pages. So maybe copy that comment over from the original patch?

archiecobbs commented 2 years ago

But I think this still does not solve the potential issues when using transparent huge pages. So maybe copy that comment over from the original patch?

Can you explain the issue? I didn't understand your comment (because I don't know what a "transparent huge page" is).

If there is some incompatibility with them, why don't they mention that in the man page?

Nikratio commented 2 years ago

A huge page is a page that is larger than getpagesize(), i.e. several megabytes instead of the standard 4k.

The manpage is probably out of date and someone who knows how this works in the kernel should be updating it.

More background (also partially outdated, but still a good introduction): https://lwn.net/Articles/359158/

Nikratio commented 2 years ago

More recent and relevant: https://lwn.net/Articles/686690/

archiecobbs commented 2 years ago

Thanks for the links.

However, I was not able to infer any particular problems with THP's and posix_fadvise(2).

In any case, I added a vague comment in 697857c.