Champ-Goblem commented 2 years ago

Nydus support for bypassing guest cache with DAX

tldr: We would like to get nydus with DAX working to bypass the guest page cache and therefore deduplicate the memory that is being used by a machine, overall allowing us to spawn more workloads and use memory more efficiently.

I have been debugging an issue over the last week which has been plaguing our nodes for a few months. Workloads run with Kata report memory usage within the VM of the running workload as much less than compared to what the host machine reports for the workloads from outside of the VM.

From my debugging and investigation I think the main cause of the issue comes from the page cache of the guest operating system. The guest OS’s page cache differs from the host's because the guest’s can only be reclaimed by workloads running within the context of the guest VM, whereas the host’s can be reclaimed at any point when needed. This looks to be because the host machine is unable to differentiate what a guest VM is assigning the memory to, it has no understanding of how the guest OS is allocating the memory pages it requests from the host and thus is unable to free them back to the host when in a low memory condition.

I can prove this by running a write heavy workload in Kata and watching how the memory is allocated in both the guest and the host, monitoring what these allocations correspond to. In the particular case below I was running dd if=/dev/zero of=/file.dat bs=512 count=12582912 to write a 6GB file to the root filesystem of the container, backed by nydus (virtiofs). The container was also running inside a Cloud Hypervisor VM provisioned by Kata containers.

image(3)

The above graph shows the current shared memory usage of host machine as reported in /proc/meminfo, this relates to the memory allocated to cloud hypervisor for the VM’s RAM. Over the course of the write, the memory can be seen slowly increasing, which corresponds to the growth of page cache used within the guest:

image(7)

image(4)

This graph also illustrates the above memory usage of the guest VM in terms of process RSS gathered from /proc/<pid>/status

image(5)

image(6)

The two graphs above show how the host’s page cache is affected by the memory now in use by the VM. In the first graph we can see that the total page cache of the host machine grows in accordance to the workload within the VM writing to a file shared by virtiofs. When compared to the previous graphs detailing the guest’s memory usage, the host cache stops growing at around the 18:33 mark. The reason for this is illustrated in the bottom graph, the total memory usage of the node is essentially “full”, after 18:33 the host kernel starts to evict some of the host page cache to allow for the growing allocations required by the guest VM, in accordance to the guest’s growing page cache.

From the above setup we can see that the filesystem is being cached in two places, first in the host page cache and then again in the guest’s page cache. Due to the differences in how these allocations are perceived by the host, the guest’s page cache is essentially wasting memory because it can no longer be reclaimed by the host when needed.

This soon starts to add up when running 100s of Kata based pods, which is often amplified when running read/write intensive workloads like databases. This wastes memory that could otherwise be used for starting new workloads, and in some cases may also lead to a node becoming unresponsive due to a low memory condition. We often see quite a large difference in the memory usage of the host node when compared to the sum of the memory usage of the workloads within the VMs.

In order to combat this problem I was looking at the various caching options available to virtiofs/nydus. First we tried the CachePolicy=Never option by overriding settings passed to the passthrough FS setup in the nydus code. This doesnt have any affect on the guest cache, which we discovered after running some tests and reading more into what the flag controls. Since then I have been having a look at DAX, which is supposedly the solution to the problem of double caching.

According to this reply in an issue on the Kata Container’s github (https://github.com/kata-containers/kata-containers/pull/4648#issuecomment-1183861849) nydus should support dax. From my testing this doesnt seem to be the case, first off running Cloud Hypervisor on version 22 (later versions removed the dax option), the VM fails to boot and crashes nydus. I could see from the trace logs from nydus that a setup mapping request is received, but would be replied to with EINVAL causing the mapping request to fail. I managed to fix this by adjusting the protocol features set by nydus when initiating the vhost_user connection to the VMM. I added

VhostUserProtocolFeatures::SLAVE_SEND_FD

to https://github.com/dragonflyoss/image-service/blob/master/src/bin/nydusd/virtiofs.rs#L159

This matches the required features for Cloud Hypervisor to start the vhost_user manager instance to deal with setup mapping requests.

(https://github.com/cloud-hypervisor/cloud-hypervisor/blob/v22.0/virtio-devices/src/vhost_user/fs.rs#L400 where the protocol features are checked, controlling the creation of the master request handler)

(https://github.com/cloud-hypervisor/cloud-hypervisor/blob/v22.0/virtio-devices/src/vhost_user/fs.rs#L516 where the setup mapping handler is created later on)

(https://github.com/cloud-hypervisor/cloud-hypervisor/blob/v22.0/virtio-devices/src/vhost_user/fs.rs#L74 the vhost master handler for the mapping/unmapping requests)

Even though this change allows the workload to start running, and the virtiofs filesystem is mounted with the dax option. Most workloads I tested with this enabled, ended up failing at some point during its run time, for some reason or another. I have not had a chance yet to debug what might be causing the failures.

We would appreciate some thoughts on the DAX situation and whether or not nydus can support it and if there are any upsides or downsides to running virtiofs in DAX mode.

So far we have been impressed with the performance boost nydus has enabled and if we can get DAX support working as expected I think we will be very happy overall.

bergwolf commented 2 years ago

@Champ-Goblem Thanks for raising it! There are two host access modes provided by nydusd to Kata: rafs and passthroughfs. rafs is a container image acceleration file system while passthroughfs is an equivalance to virtiofsd. May I ask which part you want to enable DAX with? The passthroughfs should be easy to enable DAX support, while rafs needs more changes but it's do-able as well.

hsiangkao commented 2 years ago

Actually, we (Alibaba Cloud) have a new pending academic paper to evaluate virtiofs dax vs Nydus blob passthough(blobfs) + erofs dax. The performance number shows that the workload with Tar 70000+ small files (linux-5.10.87), raw virtiofs dax (30 sec) is much much slower than Nydus blob passthough + erofs dax (6 sec). Also Nydus blob pass though + erofs dax is already in production at Alibaba Cloud and it has great performance and memory saving. However, it still needs some step to upstream it to the mainline kernel. Another way to achieve this is to use on-demand Virtio-pmem + erofs dax, it has almost native performance as well.

Champ-Goblem commented 2 years ago

Thanks for the above answers!

@bergwolf Looks like we are currently using PassthroughFs as seen when checking the daemon info of a running nydusd instance:

"backend_collection":{"/containers":{"backend_type":"PassthroughFs","mountpoint":"/containers","mounted_time":"2022-10-23 00:22:09.405991151 +01:00","config":null}}}

@hsiangkao these options sound interesting, I was wondering if the Nydus blob pass though + erofs dax has a limit on the container image type? Are the performance enhancements this would bring limited to rafs formatted images, or will it also work with the standard OCI spec images as well without conversion?

To add to the above options, we would ideally like to have a solution that covers mounting persistent volumes (as well as container rootfs), as these mounts can contribute to cache usage for stateful workloads. Plus any performance improvements we could bring to persistent volumes would also be beneficial.

hsiangkao commented 2 years ago

@hsiangkao these options sound interesting, I was wondering if the Nydus blob pass though + erofs dax has a limit on the container image type? Are the performance enhancements this would bring limited to rafs formatted images, or will it also work with the standard OCI spec images as well without conversion?

Well, OCI images (targz) also need to be extracted to some localfs (like ext4), likewise, we could also make a light conversion to erofs type for guest or runC to use, that is what we're doing. In principle, such conversion is more lightweight to extract a targz image to the localfs.

bergwolf commented 2 years ago

@Champ-Goblem If image conversion is a non-starter, the passthrough mode is the way to go. For DAX support with the passthrough mode, we have it working with the builtin virtiofs with dragonball. So in theroy it is just about enabling the same feature with nydusd where the vhost-user protocol is the only addition. I would suggest that we test it with QEMU first to make sure that there aren't virtio spec intepretion ceavats and then switch the vmm to cloud-hypervisor.

As for the downside of running DAX, as @hsiangkao mentioned, small files may have a performance penaty due to the extra setupmapping calls. It is possible to mitigate it a bit by enabling the per-file DAX feature so that only large files are shared via DAX.

Champ-Goblem commented 2 years ago

@hsiangkao converting OCI spec images to work with EROFS sounds interesting. Regarding the comment that is what we're doing in your previous answer, do you mean nydus already has this capability?

@bergwolf I think passthroughFS would be a good starting point for now as this would also support persistent volumes as well without having to alter lots in Kata and would also work with regular OCI images.

By the way I tested Kata 3.1 this morning with dragonball + DAX, which showed the guest page cache bypass was working as expected. Based on this, what is required to get DAX in nydus for passthroughFS implemented?

As a side note which is probably out of scope for this issue (happy to make a ticket in the relevant place if applicable):

We have a set of filesystem performance tests that we often use to test different settings and versions of VMMs and Kata, which is comprised of 3 fio commands and 3 different container image builds. When we ran the benchmarks in dragonball this morning, we noticed that build performance/stability seems to be degraded when compared to CLH/Qemu, 2/3 of the builds failed due to socket timeouts which I dont think was related to the backend server, the only build that succeeded took over 18 minutes vs 1.5 mins for CLH or Qemu. (if we are able to sort the performance with dragonball that would also be a valid option in the short term)

hsiangkao commented 2 years ago

@hsiangkao converting OCI spec images to work with EROFS sounds interesting. Regarding the comment that is what we're doing in your previous answer, do you mean nydus already has this capability?

I haven't followed Nydus userspace code anymore for months, hopefully @zyfjeff @liubogithub @jiangliu could give more hints about this.

bergwolf commented 2 years ago

@Champ-Goblem

By the way I tested Kata 3.1 this morning with dragonball + DAX, which showed the guest page cache bypass was working as expected. Based on this, what is required to get DAX in nydus for passthroughFS implemented?

The only addition is the vhost-user-fs protocol. I would suggest upgrading the vhost crate in nydus first to include the vhost-user-backend there (instead of the old stand-alone one) to have the latest working code and see what we can get there.

We have a set of filesystem performance tests that we often use to test different settings and versions of VMMs and Kata, which is comprised of 3 fio commands and 3 different container image builds. When we ran the benchmarks in dragonball this morning, we noticed that build performance/stability seems to be degraded when compared to CLH/Qemu, 2/3 of the builds failed due to socket timeouts which I dont think was related to the backend server, the only build that succeeded took over 18 minutes vs 1.5 mins for CLH or Qemu.

Any chance you can share the fio commands and/or the container images? It makes sense to create an issue in kata-containers repo to track it down there. We certainly don't expect to see such performance/stability issues with dragonball.

studychao commented 2 years ago

Hi @Champ-Goblem please help to share some simple instructions to recreate the test so that we could follow up with the issue:)

Champ-Goblem commented 2 years ago

The only addition is the vhost-user-fs protocol. I would suggest upgrading the vhost crate in nydus first to include the vhost-user-backend there (instead of the old stand-alone one) to have the latest working code and see what we can get there.

I have had a look at upgrading the vhost crate on nydus but even though vhost-user-backend was moved into the same repository as the vhost crate I still think its a separate crate for now (although I could be wrong there). I did also try to upgrade these crates in nydus but it seems to be limited by the versions that fuse-backend-rs is using, although there looks to be a PR in the works for upgrading the fuse-backend-rs crate dependencies. I have very limited knowledge on implementing dax and the changes that nydus would require so I would appreciate either someone who has more info on the dax/nydus setup to take a look or alternatively I am more than happy to hop on a call to discuss live.

Regarding dragonball I will create an issue in Kata with references to the tests we ran and how to reproduce. Ill send over a link to that when its created.

Champ-Goblem commented 2 years ago

I’ve been doing some more digging into DAX and nydus over the last day and I have a couple of questions regarding a potential problem I found that I hope you might be able to share some insight on.

So currently the status with DAX + Qemu is that the vhost crate doesnt have support for the newer VhostUserFSSlaveMsg schema, so I decided to continue with looking at CLH and DAX as CLH should be using the same crate version and thus the same schema.

When running nydus with DAX in CLH it is very close to working, currently it boots and allows the virtiofs filesystem to be mounted with DAX. Simple read and writes to the filesystem work as expected but it fails when doing more complicated tasks such as git clones.

I managed to pinpoint the problem to a virtio message that nydus receives:

[2022-10-26 16:11:38.236757 +01:00] INFO [/home/administrator/Work/fuse-backend-rs/src/transport/virtiofs/mod.rs:86] fuse: from_descriptor_chain len 36 addr 548086480896 flags 1
[2022-10-26 16:11:38.236805 +01:00] INFO [/home/administrator/Work/fuse-backend-rs/src/transport/virtiofs/mod.rs:88] fuse: from_descriptor_chain last addr 2147483647 num regions 1
[2022-10-26 12:52:17.023344 +01:00] ERROR [error/src/error.rs:21] Error:
    InvalidDescriptorChain(FindMemoryRegion)
    at src/bin/nydusd/daemon.rs:140
    note: enable `RUST_BACKTRACE=1` env to display a backtrace

from_descriptor_chain len 36 addr 548086480896 flags 1 this line in the above output details the current descriptor in the virtio descriptor chain being parsed. The address listed in the above message translates to hex 0x7f9c800000 which incidentally corresponds to the address of the region allocated for the dax cache:

"_virtio-pci-_fs1": {
      "id": "_virtio-pci-_fs1",
      "resources": [
        {
          "MmioAddressRange": {
            "base": 548085432320,
            "size": 524288
          }
        },
        {
          "MmioAddressRange": {
            "base": 548086480896, <--- cache
            "size": 1073741824
          }
        }
      ],
      "parent": null,
      "children": [
        "_fs1"
      ],
      "pci_bdf": "0000:00:04.0"
    }

It points more specifically at the only file that has been mapped into this region so far for reading/writing.

Is this an expected entry in the descriptor table? I thought that the descriptor table should only point to entries in the virtio ring and that the address referred to by the descriptor should point to some fuse/virtiofs message rather than a base address of a newly mmaped file?

My other question is, which part of the stack controls writing into the ring created by virtiofs, is it the virtiofs kernel module in the guest or the actual VMM that writes into the virtio queues?

For reference the error message is thrown here in the nydus code https://github.com/dragonflyoss/image-service/blob/v2.1.0-rc.3.1/src/bin/nydusd/virtiofs.rs#L78

Any help would be appreciated in getting to the bottom of this.

liubogithub commented 2 years ago

@Champ-Goblem By looking at the https://github.com/dragonflyoss/image-service/blob/v2.1.0-rc.3.1/src/bin/nydusd/virtiofs.rs#L78, I think the workload such as git clone is using mmap and dio, which causes the above panic.

fd1 = open("/mnt/virtiofs/foo");
fd2 = open("/mnt/virtiofs/bar");

addr = mmap(fd1...);
ret = write(fd2, addr...);

With dax enabled, the above addr refers to some address of virtiofs dax range, not a guest memory address.
For now nydus, as a vhost-user-fs backend, only shares the guest memory, and it doesn't know dax memory yet.

But it should be not difficult to fix as clh has gained that VHOST_USER_SLAVE_FS_IO capability.

I think dragonball and its inline-virtiofs(a virtio device mode instead of a vhost-user device mode) has fixed the above issue.

Champ-Goblem commented 2 years ago

@liubogithub Thank you for the detailed message! Last week I spent some time POCing VHOST_USER_SLAVE_FS_IO in the fuse-backend-rs/vhost crates and managed to get this somewhat working.

I would be interested to know if you had any ideas on how to debug the following problem. Since implementing the changes, I found that Cloud Hypervisor when mapping a file into the DAX region would sometimes pass and sometimes fail. When this fails it causes the KVM_RUN ioctl to fail with -EFAULT. The log for the failed KVM_RUN:

cloud-hypervisor: 13.739285612s: <_fs1> INFO:virtio-devices/src/vhost_user/fs.rs:82 -- fs_slave_map
cloud-hypervisor: 13.739303236s: <_fs1> INFO:virtio-devices/src/vhost_user/fs.rs:65 -- fs: is_req_valid
cloud-hypervisor: 13.739310276s: <_fs1> INFO:virtio-devices/src/vhost_user/fs.rs:99 -- fs_slave_map: addr 139754071408640 flags MAP_R len 2097152 fd_offset 0 cache_offset 0 mmap_cache_addr 139754071408640
cloud-hypervisor: 13.739333094s: <vcpu0> ERROR:vmm/src/cpu.rs:979 -- VCPU generated error: VcpuRun(Failed to run vcpu: VCPU error regs Ok(kvm_regs { rax: 140536991698302, rbx: 18446683600576527944, rcx: 3454, rdx: 3454, rsi: 18446613230156513280, rdi: 140536991694848, rsp: 18446683600576527304, rbp: 0, r8: 18446683600576527960, r9: 0, r10: 3454, r11: 1024, r12: 0, r13: 3454, r14: 18446683600576527944, r15: 0, rip: 18446744071586207022, rflags: 328194 }) debug_regs Ok(kvm_debugregs { db: [0, 0, 0, 0], dr6: 4294905840, dr7: 1024, flags: 0, reserved: [0, 0, 0, 0, 0, 0, 0, 0, 0] }) events Ok(kvm_vcpu_events { exception: kvm_vcpu_events__bindgen_ty_1 { injected: 0, nr: 13, has_error_code: 1, pending: 0, error_code: 0 }, interrupt: kvm_vcpu_events__bindgen_ty_2 { injected: 0, nr: 0, soft: 0, shadow: 0 }, nmi: kvm_vcpu_events__bindgen_ty_3 { injected: 0, pending: 0, masked: 0, pad: 0 }, sipi_vector: 0, flags: 13, smi: kvm_vcpu_events__bindgen_ty_4 { smm: 0, pending: 0, smm_inside_nmi: 0, latched_init: 0 }, reserved: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], exception_has_payload: 0, exception_payload: 0 }) err Error(14)

I spent some time tracing this with BPF to try and identify where the error was being generated from, and it seems that the EFAULT error is returned during direct_page_fault -> handle_abnormal_pfn -> kvm_handle_bad_page. The kvm_handle_bad_page is called when the pfn of the page fault is masked to an error, but the reason for it being masked I was unable to figure out.

I have attempted to compare the CLH code versus the Dragonball code but they both seem pretty similar when setting up the DAX window, so I'm not too sure what could be causing this error. I understand that this is not the easiest to explain in detail over a message, but I would appreciated any thoughts you may have on figuring out possible situations that could cause an error like this to be generated in the current context.

dragonflyoss / nydus

Nydus support for bypassing guest cache with DAX #810

Nydus support for bypassing guest cache with DAX