NVSL / linux-nova

NOVA is a log-structured file system designed for byte-addressable non-volatile memories, developed at the University of California, San Diego.
http://nvsl.ucsd.edu/index.php?path=projects/nova
Other
421 stars 118 forks source link

Regarding recovery of NOVA after crash #72

Open rohankadekodi opened 5 years ago

rohankadekodi commented 5 years ago

This issue has been raised to make sure if NOVA recovers correctly on mounting after a crash. The workload that is used to check recovery is the following:

The detailed sequence of steps are as follows:

  1. Create an empty NOVA file system on pmem0 (mount -t NOVA -o init /dev/pmem0 /mnt/pmem0)
  2. Take a snapshot of pmem0 (which should include the mkfs and mount data)
  3. crash
  4. Restore the snapshot on pmem0 device
  5. mount NOVA (not init, just mount) (mount -t NOVA /dev/pmem0 /mnt/pmem0)

Here, NOVA fails to mount at step 5. This should ideally work, because the snapshot taken at step 2 contains all the data regarding the initialization of NOVA. So in step 5, after the initialization data has copied to pmem0, it should see the initialization data of NOVA and mount the file system.

The error in dmesg is: [ 1208.605348] nova: nova_get_nvmm_info: dev pmem1, phys_addr 0x48000000, virt_addr ffffc90008000000, size 134217728 [ 1208.615478] nova: measure timing 0, metadata checksum 0, inplace update 0, wprotect 0, data checksum 0, data parity 0, DRAM checksum 0 [ 1208.630455] nova: Start NOVA snapshot cleaner thread. [ 1208.635824] nova: Running snapshot cleaner thread [ 1208.643671] nova: NOVA: Failure recovery [ 1208.649243] nova: Recovered 0 snapshots, latest epoch ID 0 [ 1208.660627] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 1208.666171] IP: nova_traverse_inode_log.isra.10+0x29/0x100

NOVA is on kernel version 4.13 and on a pmem device of size 128MB.

Andiry commented 5 years ago

Thanks for reporting the issue. I am trying to reproduce the issue but failed.

First, we have updated the master branch to 4.18, so please try the latest master branch.

Second, the steps are not very clear. I am not sure how you emulate a crash. Is there a tool to do that? In NOVA development, I make nova_try_normal_recovery() returns false to emulate the crash.

Here are my reproduce steps based on the description:

mount -t NOVA -o init /dev/pmem0 /mnt/ramdisk/ touch /mnt/ramdisk/test1 echo 1 > /proc/fs/NOVA/pmem0/create_snapshot umount /mnt/ramdisk mount -t NOVA -o snapshot=0 /dev/pmem0 /mnt/ramdisk/ umount /mnt/ramdisk mount -t NOVA /dev/pmem0 /mnt/ramdisk/

I don't see a crash; Please specify your exact commands to reproduce the issue.

Info in dmesg: [ 330.976369] nova: nova_get_nvmm_info: dev pmem0, phys_addr 0x100000000, virt_addr 0xffff9b6d40000000, size 3221225472 [ 330.976371] nova: measure timing 0, metadata checksum 0, wprotect 0, data checksum 0, data parity 0, DRAM checksum 0 [ 330.976528] nova: Start NOVA snapshot cleaner thread. [ 330.976547] nova: NOVA: Failure recovery [ 330.976552] nova: Running snapshot cleaner thread [ 330.976688] nova: Restore snapshot epoch ID 0 [ 330.976697] nova: Recovered 1 snapshots, latest epoch ID 0 [ 330.989939] nova: Failure recovery total recovered 2 [ 330.990410] nova: Current epoch id: 0

Andiry commented 5 years ago

Also I tried to compile CrashMonkey; Seems it does not work with 4.18 yet. Is there a simple way to emulate the crash and reproduce the bug?

vijay03 commented 5 years ago

Hi Andiry,

It seems there is a misunderstanding. I'll try to clarify, but my students will be able to provide more detail.

We are not saying a sequence of commands causes a kernel crash. We are saying once NOVA has been mounted, if there is a power loss, it does not seem to recover correctly.

Our sequence of steps to reproduce this (roughly):

What this emulates is that there is a power-loss crash after NOVA was mounted, and hence it didn't have a chance to cleanly unmount. From this state, it seems like NOVA isn't able to recover correctly.

To reproduce the reported bug, you don't need CrashMonkey at all.

Hope this helps!

Andiry commented 5 years ago

Thanks Vijay for the clarification. I never tried using dd to emulate power loss before, will try to reproduce with your steps. I think I get confused when Rohan mentioned "taking a snapshot" he was meaning using dd but I was thinking of the snapshot support in NOVA.

vijay03 commented 5 years ago

Yes, I realized it was ambiguous when I saw your response! Apologies for the delay -- everyone is traveling for winter break, or someone in my group would have responded sooner.

Let us know if you run into any problems reproducing it! I see the last commit to master is in Oct; we used the master branch, so our experiments should be reproducible on master.

jayashreemohan29 commented 5 years ago

Hi, Vijay’s description should help you reproduce the issue. To add to it, we had some issues running kernel 4.18(the master branch). The pmem devices were not recognized on reboot. So we switched back to the earlier 4.13 kernel. Is there anything else we need to enable in the menu config during compilation, in addition to CONFIG_X86_PMEM_LEGACY, CONFIG_FS_NOVA and all subitems under Device Drivers > NVDIMM ?

Andiry commented 5 years ago

I tried on 4.18 but fail to reproduce:

mount -t NOVA -o init /dev/pmem0 /mnt dd if=/dev/pmem0 of=pmem0ss bs=1M umount /mnt dd if=pmem0ss of=/dev/pmem0 bs=1M mount -t NOVA /dev/pmem0 /mnt

My colleague Juno tried on 4.13 and failed to reproduce as well. Is it 100% reproducible? Do I need to perform some file operations before running dd?

@jayashreemohan29 I have attached my 4.18 config. Please remove the .txt suffix.

config-4.18.0+.txt

stevenjswanson commented 5 years ago

The pmem problem has plagued me on and off. I just keep rebooting until they show up.

-steve

-- Composed on (and maybe dictated to) my phone.

On Dec 17, 2018, at 17:20, Jayashree Mohan notifications@github.com wrote:

Hi, Vijay’s description should help you reproduce the issue. To add to it, we had some issues running kernel 4.18(the master branch). The pmem devices were not recognized on reboot. So we switched back to the earlier 4.13 kernel. Is there anything else we need to enable in the menu config during compilation, in addition to CONFIG_X86_PMEM_LEGACY, CONFIG_FS_NOVA and all subitems under Device Drivers > NVDIMM ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

rohankadekodi commented 5 years ago

Hi Andiry,

The steps that we followed which led us to the recovery problem were:

  1. mount -t NOVA -o init /dev/pmem0 /mnt
  2. dd if=/dev/pmem0 of=pmem0ss bs=1M count=128 (Our pmem0 partition size is 128MB)
  3. umount /mnt
  4. dd if=/dev/zero of=/dev/pmem0 bs=1M count=128 (The pmem0 device file is completely cleared)
  5. dd if=pmem0ss of=/dev/pmem0 bs=1M count=128
  6. mount -t NOVA /dev/pmem0 /mnt

I think you missed step 4. For us, it is 100% reproducible with these steps, on the 4.13 kernel.

juno-kim commented 5 years ago

Could you specify the commit you tested and share your .config file?

The steps still don't reproduce the bug even after adding step 4.

Andiry commented 5 years ago

Thanks Rohan. I tried your steps with clearing pmem0 device on both 4.18 and 4.13, but fail to reproduce.

Are you testing on a VM or bare-metal machine? We found some weird issues when running NOVA on VM.

Anyway, can you apply the patch attached, reproduce and post the dmesg? Thanks. test.patch.txt

vijay03 commented 5 years ago

Andiry, how big of a pmem partition are you using? Perhaps the bug is only exposed with small partitions? I think we are using the same kernel version, and same NOVA version. So I'm trying to narrow down what else could be different.

I think bug is reproducible on both bare-metal and virtual machine on us, but I'll let @rohankadekodi confirm.

Andiry commented 5 years ago

Typically I am using 4GB, but I will try the small partitions.

rohankadekodi commented 5 years ago

Hi Andiry,

just tried the same steps on bare-metal, and found that the bug is not reproducible on bare-metal. So, this is a problem of NOVA running in a virtual machine. Could you try running the 6 steps mentioned here in a virtual machine with size of pmem0 as 128MB?

I will apply the patch and post the dmesg of NOVA when running in a virtual machine.

Thanks, Rohan

Andiry commented 5 years ago

Thanks for confirming Rohan. I will try on VM.

Andiry commented 5 years ago

I tried Ubuntu 18.04.1 on VM and still did not reproduce. I tried 4.13 and 4.18.

jayashreemohan29 commented 5 years ago

Hi Andiry, We just figured out that we are using the lightweight Arch-Linux distribution on our VM, and the issue only shows up in this one so far. We will get back to you if we are able to reproduce it on Ubuntu. Thank you for investigating this issue with us, we really appreciate it. And sorry for not having figured out the distribution earlier.

Thanks, Jayashree Mohan

On Fri, Dec 21, 2018 at 12:46 PM Andiry Xu notifications@github.com wrote:

I tried Ubuntu 18.04.1 on VM and still did not reproduce. I tried 4.13 and 4.18.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NVSL/linux-nova/issues/72#issuecomment-449286572, or mute the thread https://github.com/notifications/unsubscribe-auth/AdB-tMdRteyfvebM-PKuQEfjaqruW72bks5u7Iq6gaJpZM4ZC9xW .

Andiry commented 5 years ago

That's OK and thank you for the help. We always welcome people to try NOVA and report issues.

williewillus commented 5 years ago

Hi all. I was able to reproduce this in a very straightforward manner from the latest Ubuntu 18.04 install.

  1. Install ubuntu server 18.04 into VirtualBox VM (QEMU reproduces this problem too, but I wanted to try a different hypervisor)
  2. Compile linux-nova (current master branch) with this config config.txt
  3. Transfer the bzImage into VM and boot from it with command line memmap=128M!1G
  4. Run the steps of the above comment
  5. NOVA fails to remount, dmesg log: out.txt