Very inefficient block cache

wkozaczuk commented 5 months ago

As Jan Braunwarth eloquently explains in his bachelor thesis, the OSv block cache is very inefficient:

"OSv also has a cache that should increase I/O performance, but it is very inefficient and, as you can see in Figure 4.8, does not lead to an increase but rather a dramatic drop. If you look at how the block cache works, it quickly becomes clear why this is. Each I/O is initially divided by the cache into 512 byte blocks. Then, when a read request is made, each block is checked to see whether it is already in the cache and, if so, copied directly from there to the target address. Since the RAM can answer the request much faster, this administrative effort is worth it. The problem is what happens when the block is not yet in the cache.

For example, if an application wants to read a 1 MiB file that is not yet in the cache, the request is divided into 2048 I/Os, each 512B in size. These 2048 requests are then all processed sequentially and also copied from the block cache to the target address. The measured IOPS are therefore significantly lower than the number of SQEs that were processed by the NVMe."

It must be noted, however, that most applications will not be affected by it as they will go through the VFS layer. The filesystems drivers in OSv (ZFS, RoFS, and recently EXT4) bypass the block cache and call devops->strategy() directly

To reproduce this problem, one can use the fio app setup to read from disk directly (as bypasses the file system):

/fio --name=fiotest --filename=/dev/nvme1n1 --size 10Mb --rw=read ....

There are at least two options to fix this moderately important issue:

improve the block cache and ideally make devops->strategy() use it as well (more difficult)
change block device drivers to replace bread and bwrite with code similar to what the strategy functions do like in this proposed patch (easy):

diff --git a/drivers/virtio-blk.cc b/drivers/virtio-blk.cc
index 48750a01..4f7676e9 100644
--- a/drivers/virtio-blk.cc
+++ b/drivers/virtio-blk.cc
@@ -49,6 +49,9 @@ TRACEPOINT(trace_virtio_blk_req_err, "bio=%p, sector=%lu, len=%lu, type=%x", str
 using namespace memory;

+int
+bdev_direct_read_write(struct device *dev, struct uio *uio, int ioflags);
+
 namespace virtio {

 int blk::_instance = 0;
@@ -71,7 +74,8 @@ blk_strategy(struct bio *bio)
 static int
 blk_read(struct device *dev, struct uio *uio, int ioflags)
 {
-    return bdev_read(dev, uio, ioflags);
+    return bdev_direct_read_write(dev, uio, ioflags);
 }

 static int
@@ -82,6 +86,7 @@ blk_write(struct device *dev, struct uio *uio, int ioflags)
     if (prv->drv->is_readonly()) return EROFS;

-     return bdev_write(dev, uio, ioflags);
+    return bdev_direct_read_write(dev, uio, ioflags);
 }

 static struct devops blk_devops {
diff --git a/fs/vfs/kern_physio.cc b/fs/vfs/kern_physio.cc
index c7c99c72..80c22ccc 100644
--- a/fs/vfs/kern_physio.cc
+++ b/fs/vfs/kern_physio.cc
@@ -138,3 +138,50 @@ void multiplex_strategy(struct bio *bio)
        len -= req_size;
    }
 }
+
+int
+bdev_direct_read_write(struct device *dev, struct uio *uio, int ioflags)
+{
+    bio* complete_io = alloc_bio();
+
+    u8 opcode;
+    switch (uio->uio_rw) {
+    case UIO_READ :
+        opcode = BIO_READ;
+        break;
+    case UIO_WRITE :
+        opcode = BIO_WRITE;
+        break;
+    default :
+        return EINVAL;
+    }
+
+    refcount_init(&complete_io->bio_refcnt, uio->uio_iovcnt);
+
+    while(uio->uio_iovcnt > 0) 
+    {
+        bio* bio = alloc_bio();
+        bio->bio_cmd = opcode;
+        bio->bio_dev = dev;
+
+        bio->bio_bcount = uio->uio_iov->iov_len;
+        bio->bio_data = uio->uio_iov->iov_base;
+        bio->bio_offset = uio->uio_offset;
+
+        bio->bio_caller1 = complete_io;
+        bio->bio_private = complete_io->bio_private;
+        bio->bio_done = multiplex_bio_done;
+
+        dev->driver->devops->strategy(bio);
+
+        uio->uio_offset += uio->uio_iov->iov_len;
+        uio->uio_resid -= uio->uio_iov->iov_len;
+        uio->uio_iov++;
+        uio->uio_iovcnt--;
+    }
+    assert(uio->uio_resid == 0);
+    int ret = bio_wait(complete_io);
+    destroy_bio(complete_io);
+
+    return ret;
+}

nyh commented 5 months ago

As Jan Braunwarth eloquently explains in his bachelor thesis, the OSv block cache is very inefficient:

Interesting, can you please post here a link to this bachelor thesis?

wkozaczuk commented 4 months ago

Let me send it to you! It is in german but you can easily google-translate.

cloudius-systems / osv

Very inefficient block cache #1318