Open hayley-leblanc opened 3 years ago
Can you elaborate the repro steps a little bit? I have mounted NOVA with metdata_csum, data_csum, and data_parity, and run test2.cpp and there is no error.
I'm working with NOVA on a QEMU/KVM virtual machine with 1 core running Ubuntu 20.04. I wonder if it's an artifact of the specific setup I'm using. I'll mess around with the setup and configurations a bit and get back to you.
Here are more details about my setup and the steps to reproduce:
Setup: QEMU/KVM virtual machine running Ubuntu 20.04 and Linux kernel 5.1.0+ with KASAN. I initially ran into the issue on a VM with 8GB RAM, 128MB emulated persistent memory, and 1 core. I've also tried it out on 4GB of persistent memory and 4 cores and still run into the same issue. I'm building NOVA as a loadable module and manually loading it using insmod
.
Steps to reproduce: I experimented a bit and found that both issues can be reproduced with only data_csum.
insmod path/to/nova/nova.ko data_csum=1
mount -t NOVA -o init /dev/pmem0 /mnt/pmem
I dug around in the code a bit and the second issue (the one causing pread()
to fail for me) seems to occur in nova_verify_data_csum()
when it's called during the pread()
. The problem specifically only arises when the pread()
is close to the size of the whole file; it seems to specifically be a problem with reading the very last stripe. I modified the for loop on line 758 of checksum.c to only iterate over strps-1
stripes and that allows pread()
to complete successfully, but then it's not actually verifying everything. The KASAN bug appears to be hit on line 572 of checksum.c, in nova_update_stripe_csum()
, but I don't know exactly why and I don't have a workaround for that one.
I have tried again and cannot reproduce on a bare-metal machine. I do suspect this is related with VM and I have seen people reported issues on VM before. I think there are some kernel flags related to memory need to be disabled for NOVA running on VM, but I can't recall them - perhaps they are mentioned in some other issues. Can you try on a bare-metal machine to see if it is reproducible?
I tried it on a bare-metal machine and was not able to replicate the issue there. I compared the execution on bare-metal vs. the VM and found that the issue comes from line 174 of parity.c: https://github.com/NVSL/linux-nova/blob/b817ca322e6fc61f532174e7effc4b6c81528e3f/fs/nova/parity.c#L174
unroll_csum
is set to 0 on my VM and 1 on my bare-metal machine. This led to the following if statement evaluating to true on the VM but not on baremetal:
https://github.com/NVSL/linux-nova/blob/b817ca322e6fc61f532174e7effc4b6c81528e3f/fs/nova/parity.c#L248-L250
The two problems described in this issue only occur when the contents of the if statement are run, which is why they aren't showing up on baremetal.
The root of the problem on the VM seems to be that by default, the virtualized CPU does not have the X86_FEATURE_XMM4_2
feature flag that is used to set unroll_csum
. I was able to fix this on my VM by adding -cpu host
to the qemu command that starts the VM.
This still seems like an bug to me, since it does appear that NOVA is intended to support CPUs without this feature.
Hi Andiry,
I'm messing around with NOVA's metdata_csum, data_csum, and data parity and I'm running into two issues (in regular execution, not crash consistency). I'm mounting a fresh instance of NOVA at /mnt/pmem and running a program test2.zip that performs the following operations:
pwrite()
pread()
.This program runs into two issues.
There is a KASAN bug that occurs during the
pwrite()
with the following output:The
pread()
fails and prints out a big error: