Handle corrupt buffer size during GCache recovery

TheGOro commented 2 years ago

4bb58377cf3ee02e4c69ce329c2e099b07c79368 did not solve the whole GCache recovery from a corrupted galera.cache file use-case, as the scan for discarding locked-in buffers part could experience the same issue, when the size is incorrect in the given buffer header.

ayurchen commented 2 years ago

@TheGOro Hello, corrupt buffers should be taken care of in RingBuffer::scan() and I don't see how they can leak to the recovery stage. Do you have any test that proves that this patch fixes some problem? Logical explanation would do ;)

TheGOro commented 2 years ago

@TheGOro Hello, corrupt buffers should be taken care of in RingBuffer::scan() and I don't see how they can leak to the recovery stage. Do you have any test that proves that this patch fixes some problem? Logical explanation would do ;)

The RingBuffer::scan() function encounters this very issue when it tries to insert an invalid memory address (resulted by the corrupt size field in the previous buffer header) as a pointer into the seqno2ptr dequeue, and it bails out gracefully due to the catch. RingBuffer::recover() continues and when scan() had managed to put at least one entry to seqno2ptr, then we could reach the "discard all the locked-in buffers" stage, where another iteration is done on the buffer headers starting from first_. Here we experience the segmentation fault because the size field is attempted to be accessed (in order to update the progress) while the buffer header pointer points to an invalid memory region.

Here is a debug session of one fault occurrence.

ayurchen commented 2 years ago

Thanks for reporting and bringing our attention to this issue. It was fixed in the most recent release, but in a different way, notably commits fbf9167c4a0e03df0284f278024a9d41d3633c4e, d91a37310de28979d3f9d95f3b58dbb5e7756bbe, d2baa5bce511bbd052b3f244b6754b48587396af

TheGOro commented 2 years ago

Thanks for reporting and bringing our attention to this issue. It was fixed in the most recent release, but in a different way, notably commits fbf9167, d91a373, d2baa5b

Thanks for the update! I already tested and verified these changes as part of the delivery that we got from our vendor.

codership / galera

Handle corrupt buffer size during GCache recovery #608