Do "nova error" messages indicate a bug in NOVA?

hayley-leblanc commented 2 years ago

Hi,

I'll go into some details below, but my main question is: is it ever expected that NOVA will produce an error via nova_err during normal execution/after a crash and successful recovery, or is such an error indicative of a bug? In experimenting with crash-testing NOVA I've been seeing some of these errors`, but the operations producing those errors do seem to complete successfully.

Specifically, I'm running a program that performs the following operations in NOVA with metadata_csum turned on:

mkdir A
creat A/foo
write 4096 bytes to A/foo
truncate A/foo to 0 bytes
close A/foo

I'm injecting crashes in the truncate and looking at various possible crash states that can arise. truncate calls nova_notify_change, which performs some updates on the affected inode and then updates the inode's checksum. I've found that when I inject a crash after the non-checksum fields have begun to be updated but before the checksum is updated, and I mount the resulting crash state, deleting A/foo produces the following logs but otherwise succeeds.

[ 4679.211793] nova: nova_find_free_slot ERROR: 18846 - 18846 already in free list
[ 4679.211798] nova: nova_free_blocks: find free slot fail: -22
[ 4679.211806] nova error: 
[ 4679.211810] Inode 35: free 1 data block from 18846 to 18846 failed!
[ 4679.212245] nova: Pi 35: log head 0x499f000, tail 0x499f068
[ 4679.212255] nova: file write entry @ 0x499f000: epoch 0, trans 0, pgoff 0, pages 1, blocknr 18846, reassigned 0, updating 0, invalid count 0, size 4096, mtime 1626547093

I noticed this because on the version of Ubuntu I'm running NOVA with, the default log levels show the nova error message to the user on the terminal. Deleting A/foo DOES succeed (returns a success code and the file is actually deleted) but the error message being displayed to the user seems undesirable.

Thanks!

Andiry commented 2 years ago

That does sound like a bug. NOVA tries to free some blocks which are already marked as free. Tracking the allocation/free of the ranges should help to locate the bug.

hayley-leblanc commented 2 years ago

I just got around to taking a look at this issue. If my understanding of the free list management in NOVA's recovery is correct, I think the problem looks something like this at a high level:

The truncate operation removes some blocks from foo (I didn't look into how this is done in detail) and some fields of the primary inode are updated to reflect this, but the checksum is not
We crash
On remount, NOVA reconstructs the free list based on the current version of foo's inode, which reflects the deallocation/freeing of foo's blocks
Accessing foo later triggers verification of its checksum, which fails, so we repair it using the replica. The replica wasn't updated with the truncate operation prior to the crash, so this also restores the primary inode to a state prior to the truncate

At this point, the primary inode will claim that foo is using some blocks that are also in the free list.

I have a workaround, but I don't think it's a great fix - I think it's slow and does more work than necessary. In nova_failure_recovery(), right after the call to nova_failure_recovery_crawl(), I added a loop that goes over all of the inodes in the system and calls nova_iget() and iput() on each one, which will repair any inodes with checksum issues. Performing the inode repair at this point seems to fix the issue, since it ensures that all of the inodes have been repaired before we build the free list. Does this make sense as a cause/solution?

Andiry commented 2 years ago

Actually not need to explicitly call nova_iget and iput(). Instead, before going to recover the inode pages, we should first check the checksum and recover the inode, if metadata_csum is enabled. There is a FIXME at line 1331 of bbuild.c, which should be the place to add the check.

hayley-leblanc commented 2 years ago

That makes sense - my workaround felt pretty janky :) Unfortunately I'm having trouble getting that solution to work. It seems as though running the checksum verification and inode repair in failure_thread_func() repairs the inodes too late, so the free list is still wrong - is it possible that the recovery threads won't run that code until after the free list has been constructed? I messed around with trying to ensure that the recovery threads finish up earlier, but didn't have any luck. I'll work on putting together some steps to reproduce this bug outside of my testing infrastructure so you can take a look on your end.

NVSL / linux-nova

Do "nova error" messages indicate a bug in NOVA? #110