buildbarn / bb-storage

Storage daemon, capable of storing data for the Remote Execution protocol
Apache License 2.0
137 stars 91 forks source link

Fix crash when releasing all blocks #181

Closed moroten closed 10 months ago

moroten commented 10 months ago

If data corruption was found in the newest block, all blocks would be released and trigger a panic when no block was allocated for writing. See the stack trace below where lbm.newBlocks=0 and i=0 leads to lbm.allocationAttemptsRemaining = 1 << -1 which panics.

2023/11/08 14:33:20 rpc error: code = Internal desc = Releasing 1 blocks due to a data integrity error
panic: runtime error: negative shift amount
goroutine 15748 [running]:
github.com/buildbarn/bb-storage/pkg/blobstore/local.(*OldCurrentNewLocationBlobMap).startAllocatingFromBlock(...)
    pkg/blobstore/local/old_current_new_location_blob_map.go:259
github.com/buildbarn/bb-storage/pkg/blobstore/local.(*OldCurrentNewLocationBlobMap).findBlockWithSpace(0xc00025e140, 0x232)
    pkg/blobstore/local/old_current_new_location_blob_map.go:302 +0x66a
github.com/buildbarn/bb-storage/pkg/blobstore/local.(*OldCurrentNewLocationBlobMap).Put(0xc00025e140, 0x232)
    pkg/blobstore/local/old_current_new_location_blob_map.go:363 +0x27
EdSchouten commented 10 months ago

Thanks for the fix, @moroten. Really appreciated! The test case you provided is great.

With regards to the change to the implementation, I took a slightly different route. The problem with your change is that if the call to PushBack() on line 311 fails, we never end up calling startAllocatingFromBlock(). This leaves OldCurrentNewLocationBlobMap in an inconsistent state. We need to account for the case where OldCurrentNewLocationBlobMap needs to be left behind in a state where the underlying BlockList has zero blocks.

We already account for this in NewOldCurrentNewLocationBlobMap() by setting lbm.allocationBlockIndex = -1 and lbm.allocationAttemptsRemaining = 0. This effectively delays the call to startAllocatingFromBlock(0) to the very bottom of findBlockWithSpace(). I have factored this out into a new helper method resetAllocationBlockIndex(), which we now call in all places where we need to rewind.

moroten commented 10 months ago

Setting lbm.allocationAttemptsRemaining = 0 is a good trigger for the reset. Nice!