Wrong ReadLastAddConfirmed logic that can lead to data loss in client applications

BUG REPORT

Describe the bug

We have found a bug when invoking ReadLastAddConfirmed that may lead the Bookkeeper client to return a wrong last entry or no entries at all when querying the Bookies. This behavior goes against of what is stated in the Javadoc, as it does not return the maximum entry id for a ledger within an ensemble: https://github.com/apache/bookkeeper/blob/67a02db73d62188fc8a143bd9a37038ae770e90a/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/api/ReadHandle.java#L116

Obtains asynchronously the last confirmed write from a quorum of bookies. This call obtains the the last add confirmed each bookie has received for this ledger and returns the maximum.

Applications that rely on this method to return the actual last valid entry in a ledger may wrongly believe that a ledger is empty and discard it (e.g., rollover), thus losing data.

This is a real reproduction of the problem. Lets consider an ensemble with 3 Bookies (pravega-bk-bookie-0, pravega-bk-bookie-1, and pravega-bk-bookie-2 in the logs) where ensembleSize=writeQuorum=ackQuorum=3. After storing some data in the system we induce a destructive restart in pravega-bk-bookie-2, meaning that its data is wiped out (but it's identity is still the same). Then, our application tries to recover its internal state reading the data stored in Bookkeeper, and the problem manifests as follows:

Our application tries to read from ledger 5 to recover its internal state after the failure induced in pravega-bk-bookie-2:
```
2021-02-17 12:29:47,741 2251502 [Thread-10239] INFO  i.p.s.server.logs.DurableLog - DurableLog[1]: Starting.
```
To this end, the first thing it does is to ask the Bookkeeper client for the last entry of ledger 5 to start its internal recovery process with the data stored in Bookkeeper. As you can see, the metadata is correctly fetched from Zookeeper for ledger 5:
```
2021-02-17 12:29:47,750 2251511 [Thread-10234-EventThread] DEBUG o.a.b.client.MetadataUpdateLoop - UpdateLoop(ledgerId=5,loopId=6425f4c2) success
```

We observe parallel reads issued to the 3 Bookies in the ensemble:

2021-02-17 12:29:47,752 2251513 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG o.a.b.proto.PerChannelBookieClient - Got Read response from bookie:pravega-bk-bookie-2.pravega-bk-bookie-headless.default.svc.cluster.local:3181 rc:ENOENTRY, ledger:5:entry:-1:entryLength:0
2021-02-17 12:29:47,754 2251515 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG o.a.b.proto.PerChannelBookieClient - Got Read response from bookie:pravega-bk-bookie-1.pravega-bk-bookie-headless.default.svc.cluster.local:3181 rc:EOK, ledger:5:entry:-1:entryLength:139
2021-02-17 12:29:47,754 2251515 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG o.a.b.proto.PerChannelBookieClient - Got Read response from bookie:pravega-bk-bookie-0.pravega-bk-bookie-headless.default.svc.cluster.local:3181 rc:EOK, ledger:5:entry:-1:entryLength:139

As you can see, the non-restarted Bookies are returning the right info about the last entry (rc:EOK, ledger:5:entry:-1:entryLength:139), whereas the restarted Bookie is returning that it has no such an entry (rc:ENOENTRY, ledger:5:entry:-1:entryLength:0). This scenario is expected, as we induced a destructive restart in pravega-bk-bookie-2.

Unfortunately, instead of waiting for the responses from all the Bookies and get the highest available entry in ledger 5, the Bookkeeper client returns considering only the response of the restarted Bookie. This is visible in the log message we observe:
```
2021-02-17 12:29:47,752 2251513 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG o.a.b.client.ReadLastConfirmedOp - Read Complete with enough validResponses for ledger: 5, entry: -1
```
Which corresponds to this line: https://github.com/apache/bookkeeper/blob/67a02db73d62188fc8a143bd9a37038ae770e90a/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/ReadLastConfirmedOp.java#L144

From this point onwards, the client provides an error response to the client application, outputting that there is no such an entry in ledger 5 and that the last available entry is 0 (according to the restarted Bookie):

2021-02-17 12:29:47,757 2251518 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG o.a.bookkeeper.client.PendingReadOp - No such entry found on bookie.  L5 E1 bookie: pravega-bk-bookie-2.pravega-bk-bookie-headless.default.svc.cluster.local:3181

6.Then, the logic of the client application closes the ledger and performs a rollover, thus discarding the actual data being stored in the other 2 Bookies:

2021-02-17 12:29:47,762 2251523 [BookKeeperClientWorker-OrderedExecutor-1-0] DEBUG o.a.bookkeeper.client.PendingAddOp - Submit callback (lid:5, eid: 0). rc:0
2021-02-17 12:29:47,762 2251523 [BookKeeperClientWorker-OrderedExecutor-1-0] INFO  o.a.b.client.ReadOnlyLedgerHandle - Closing recovered ledger 5 at entry 0
...
2021-02-17 12:29:47,767 2251528 [Thread-10234-EventThread] DEBUG o.a.b.client.MetadataUpdateLoop - UpdateLoop(ledgerId=5,loopId=5aab3780) success
2021-02-17 12:29:47,768 2251529 [core-7] INFO  i.p.s.s.impl.bookkeeper.Ledgers - Log[1]: Fenced out Ledger Id = 5, Sequence = 1, Status = Unknown.

This leads the application to suffer data loss sporadically, mainly depending on the result of the race condition of reads when executing ReadLastConfirmedOp:readEntryComplete().

The actual root cause of the problem is a wrong logic in RoundRobinDistributionSchedule.RRQuorumCoverageSet:checkCovered() used inside ReadLastConfirmedOp:readEntryComplete() . The current logic of this function allows it to return true even if only 1 Bookie in the ensemble does not contain such an entry or ledger (error codes BKException.Code.NoSuchEntryException or BKException.Code.NoSuchLedgerExistsException), irrespective of whether other Bookies in the ensemble do contain the right data or not. The bug is a combination of the wrong logic in RoundRobinDistributionSchedule.RRQuorumCoverageSet:checkCovered() plus the condition in ReadLastConfirmedOp:readEntryComplete() that uses that function to return the last entry to the client: https://github.com/apache/bookkeeper/blob/67a02db73d62188fc8a143bd9a37038ae770e90a/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/ReadLastConfirmedOp.java#L140 Note that this may lead applications to believe that the ledger is empty when it is actually not, so causing a potential data loss if applications decide to discard that ledger.

Important: In fact, this bug was found and apparently fixed in this commit: https://github.com/apache/bookkeeper/commit/5e399df67c2aa1e5f228c62ba8533ca3293ab147. After having a look to the new logic introduced, it seems to fix the situation that is leading to a data loss in our testing scenario. However, by some reason, this change was (mostly) reverted in this commit: https://github.com/apache/bookkeeper/commit/f373cb5976ebca41ae036a26ba69f1c341897c08. Therefore, master, branch-4.13, branch-4.12, and branch-4.11 seem to contain this bug and such potential data loss for client applications.

To Reproduce

We do not have a unit test yet to reproduce the problem in a standalone manner, but we normally get this error executing distributed system tests in our project (https://github.com/pravega/pravega). The concrete test that induces the failure is this one: it creates a 3 Bookie ensemble, writes some data, perform a destructive restart on one Bookie and then let the system to recover reading the data from Bookkeeper. We exercise this scenario frequently and we get data loss events in around 20% of the cases.

Expected behavior

The reads should wait for the responses of more Bookies, as it was proposed in commit: https://github.com/apache/bookkeeper/commit/5e399df67c2aa1e5f228c62ba8533ca3293ab147.

Screenshots

n/a.

Additional context

Consider to bring back the logic in https://github.com/apache/bookkeeper/commit/5e399df67c2aa1e5f228c62ba8533ca3293ab147 to the impacted branches.

This test case shows the issue:

+    @Test
+    public void testCoverageSetsOnNoEntry() {
+        RoundRobinDistributionSchedule schedule = new RoundRobinDistributionSchedule(
+                3, 3, 3);
+        DistributionSchedule.QuorumCoverageSet covSet = schedule.getCoverageSet();
+        covSet.addBookie(1, BKException.Code.NoSuchEntryException);
+        assertFalse(covSet.checkCovered());
+    }
+

I wonder how is it possible to run the bookie without data but preserving the bookie identity in production. By the way, it may happen if you have journalSyncData=false or some other faulty filesystem (losing writes to the journal).

I agree that there is something wrong in checkCovered, the code returns true in case there is no reason to return false but we also must check that we do have a good reason to return true (enough successful/non unknown responses?).

One problem is that NoSuchEntry may happen in this case of unpredictable Bookie behaviour and it may be also legit, in case of an healthy bookie that did not receive the entry (AQ < WQ).

I would like not to fall again into this issue, that was fixed by f373cb5. https://github.com/apache/bookkeeper/issues/2273

@eolivelli thanks for having a look. The problem we observed is with the behavior of ReadLastConfirmedOp. If you check this if block:

if (heardValidResponse
            && coverageSet.checkCovered()
            && !completed) {
            completed = true;
            if (LOG.isDebugEnabled()) {
                LOG.debug("Read Complete with enough validResponses for ledger: {}, entry: {}",
                        ledgerId, entryId);
            }

            cb.readLastConfirmedDataComplete(BKException.Code.OK, maxRecoveredData);
            return;
        }

https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/ReadLastConfirmedOp.java#L139

as soon as it receives a no entry response, the predicate of this if block evaluates to true. heardValidResponse is true and seems fine. completed is false and it also seems fine. The problem is that checkCoverage is evaluating to true, even though it shouldn't.

Totally Agreed @fpj and your unit test is self-explaining about the issue

Looks like bookie recovery process after the restart should be extended. IIRC it processes journal to replay entries that didn't make it to the entry log. It probably should check ledgers metadata against local data to detect data loss and go into a recovery state to read missing data from other bookies (or quarantine itself somehow and let autoreplicator deal with this) In this case bookie could return something like rc:MISSINGDATA instead of rc:ENOENTRY for incomplete ledgers.
alternatively, in addition to
```
    if (heardValidResponse
        && coverageSet.checkCovered()
        && !completed) {
```
we can add check for && (maxRecoveredData.getLastAddConfirmed() != LedgerHandle.INVALID_ENTRY_ID || numResponsesPending == 0). Going with numResponsesPending == 0 alone will increase request latency.

The problem, as I understood it, is that maxRecoveredData didn't get updated and the read LAC op still decided to not wait for other responses.

Thanks for chiming in, @dlg99. The call to read LAC is being used by the recovery procedure to determine from which entry id to start from. Like in the description of this issue, if there is a bookie b that lost its data and responds with LedgerHandle.INVALID_ENTRY_ID, then returning anything different than that value will make the recovery procedure skip through a bunch of entries that b doesn't have. That could make entries underreplicated silently. I'm wondering if we should return 0 in this case and force the recovery of all entries.

At the same time, some applications might be using read LAC in a different way. Instead of using it to determine from to read, they might be using it to determine up to which point they can read. In this case, returning 0 is not correct because there are more entries that it should be able to read.

Detecting the data loss is desirable as BK assumes that bookies are not going to lose data silently. I'm not sure what would be a good way to achieve it, though. Does a bookie scan all ledger metadata upon starting to see if it finds its id? That sounds expensive depending on the size of the system, although I don't expect it to be a frequent operation.

I'm also not entirely clear on the expected behavior of checkCovered. I started having a look at the existing test cases and I need to have more clarity on the expected behavior. It sounds obvious from the name, but the results I'm seeing do not match my understanding.

@fpj as I understand, checkCovered() makes sure that all "spreads" of entries got covered with AQ responses. You need to consider ES > WQ >= AQ here. Everything is simple when ES = AQ, any single response is considered to be "authoritative" (assuming no single bookie data loss scenario). Otherwise, for ES = 5, WQ = 2, AQ = 2, entries placement on bookies (round robin schedule)

b1  b2  b3  b4  b5
------------------
e1  e1
    e2  e2
        e3  e3
            e4  e4
e5              e5

if only b1 does not respond all spreads are covered, nodesUnknown < ackQuorumSize. If b1 and b2 didn't respond, nodesUnknown >= ackQuorumSize and coverageSet is not covered. Non-responding b1 and b4 should still result in covered set, as I understand, as they don't participate in any "spread" (I didn't try to confirm my understanding with unit tests).

All this is done to reduce request latency in case if there is a slow bookie.

That code assumes that bookie either has consistent dataset (recovered from journal in case of crash or misses some tail entries) or is not responding.

No we are getting into the territory of how to handle case when bookie is up but some or all of the data is missing. My original suggestion was to add a check that we updated maxRecoveredData at least one time s not the best one as I think about it more. changing checkCovered() to treat NoSuchEntryException as a "node unknown" case makes more sense (as in: remove line && covered[nodeIndex] != BKException.Code.NoSuchEntryException).

Another question is what to do with NoSuchLedgerExistsException - either handle it similarly to NoSuchEntryException or keep it. I don't remember all details of the check of ledger existence, I think bookie checks the index. Did you keep the index files but deleted the log & journal files? if deleting both results in NoSuchLedgerExistsException I'd remove it from checkCovered as well.

Then, of course, what if there is legit case when ledger was created, no entries ever written, and we are trying to read LAC? All bookies will return something like NoSuchEntryException and read LAC op will return error which changes current behavior.

@fpj Maybe something like this works:

diff --git a/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RoundRobinDistributionSchedule.java b/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RoundRobinDistributionSchedule.java
index f092839f3..25e3bc980 100644
--- a/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RoundRobinDistributionSchedule.java
+++ b/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/RoundRobinDistributionSchedule.java
@@ -375,24 +375,32 @@ public class RoundRobinDistributionSchedule implements DistributionSchedule {
         public synchronized boolean checkCovered() {
             // now check if there are any write quorums, with |ackQuorum| nodes available
             for (int i = 0; i < ensembleSize; i++) {
-                /* Nodes which have either responded with an error other than NoSuch{Entry,Ledger},
+                /* Nodes which have either responded with an error,
                    or have not responded at all. We cannot know if these nodes ever accepted a entry. */
-                int nodesUnknown = 0;
+                int nodesNoOk = 0;
+
+                /* Nodes which either have no entry or ledger, which can be potentially attributed to
+                   a data loss on a single bookie */
+                int noEntryNodes = 0;

                 for (int j = 0; j < writeQuorumSize; j++) {
                     int nodeIndex = (i + j) % ensembleSize;
-                    if (covered[nodeIndex] != BKException.Code.OK
-                        && covered[nodeIndex] != BKException.Code.NoSuchEntryException
-                        && covered[nodeIndex] != BKException.Code.NoSuchLedgerExistsException) {
-                        nodesUnknown++;
+                    if (covered[nodeIndex] != BKException.Code.OK) {
+                        nodesNoOk++;
+                    }
+                    if (covered[nodeIndex] == BKException.Code.NoSuchEntryException
+                        && covered[nodeIndex] == BKException.Code.NoSuchLedgerExistsException) {
+                        noEntryNodes++;
                     }
                 }

-                /* If nodesUnknown is greater than the ack quorum size, then
-                   it is possible those two unknown nodes accepted an entry which
-                   we do not know about */
-                if (nodesUnknown >= ackQuorumSize) {
-                    return false;
+                /* If nodesNoOk is greater than the ack quorum size, then
+                   it is possible that other nodes accepted an entry which
+                   we do not know about.
+                   It is also possible that noEntryNodes subset of these nodes
+                   agrees that entry does not exist. */
+                if (nodesNoOk >= ackQuorumSize) {
+                    return noEntryNodes >= ackQuorumSize;
                 }
             }
             return true;

This is just to illustrate the idea. I didn't even run the tests to check what it breaks (ReadLastConfirmedOpTest? ledger recovery tests?) Also it needs new unit tests which is something I didn't have time for, otherwise I'd just send the PR.

Looking at it more closely and checking your responses, @dlg99 , I see that checkCovered is checking whether it received valid responses from a subset S of bookies such that S overlaps with every ack quorum. A valid response is not only an OK response, but also NoSuchEntry and NoSuchLedger. It might be worth double checking that there aren't other valid responses we are missing, but otherwise, it sounds right to me.

In the case described in this issue, there was a data loss, so the assumption that a valid response from at least one member in each ack quorum is sufficient is violated, and consequently, the outcome incorrect. I don't think the BK protocol covers such cases. In fact, if instead of receiving a no such entry the client receives a valid LAC, then the recovery procedure would have started the recovery from that LAC and the entries [0, LAC - 1] would have been silently under-replicated.

But, I do think that the implementation LedgerHandle#readLastConfirmed doesn't quite match the contract. The javadoc says:

* Obtains asynchronously the last confirmed write from a quorum of bookies.

and it is not really returning the maximum over a quorum. I'm thinking that we should consider fixing the logic in ReadLastConfirmedOp.

@ivankelly and I have discussed this general scenario recently, where a bookie is wiped clean and brought back with the same identity. This introduces arbitrary failures (bookies giving wrong answers) for which the current logic is not currently designed to handle. The current logic assumes any explicit positive or negative response to be trustworthy and acts accordingly.

If this scenario of a wiped clean bookie can occur, then we need to update the logic to either handle arbitrary failure (tricky to do well) or prevent arbitrary failures from happening. The method that Ivan has been working on is the latter and is related to making running without the WAL more resilient.

The idea is that you can configure bookies to run a pre-boot check where they identify all open ledgers and mark them as in a limbo state because there is a risk that the bookie could have received writes that is has now lost. The bookie performs a recovery and close on the ledger, then clears the limbo status. While in this limbo status, reads of any type are treated differently, as follows:

When the requested entry exists, the bookie responds with a positive read response as normal.
When the request entry is not found, the bookie responds with an "unknown" response (EUNKNOWN).

The result is that arbitrary failures are avoided. In the case where a client receives an EUNKNOWN, it is not treated as a valid response that could cause this data loss scenario. Limbo is a temporary status that is cleared as soon as the bookie is able to complete ledger recovery. This work is largely complete already.

I think it's worth considering the general approach to arbitrary failures, whether the protocol should be designed to work correctly in the face of them, or we put in safeguards to avoid arbitrary failures from occurring. Avoidance seems to be the best route. This is mostly Ivan's work but I fully agree with his conclusions.

I've documented the approach in more detail and created a new TLA+ specification that reproduces the data loss scenario (with say E=3,WQ=3,AQ=3) and also shows that the use of the limbo status avoids it (by setting the constant UseLimboStatus to TRUE). https://github.com/Vanlightly/bookkeeper-tlaplus/blob/main/LIMBO_README.md

Also, is there not a cookie check that aborts the boot-up if the bookie starts up with its disk state wiped? In what circumstance could we desire that a wiped bookie is able to resume with the same identity?

I haven't read the whole thread, but I think i get the jist. As @Vanlightly says, this isn't a case that bookkeeper is currently designed for.

+    @Test
+    public void testCoverageSetsOnNoEntry() {
+        RoundRobinDistributionSchedule schedule = new RoundRobinDistributionSchedule(
+                3, 3, 3);
+        DistributionSchedule.QuorumCoverageSet covSet = schedule.getCoverageSet();
+        covSet.addBookie(1, BKException.Code.NoSuchEntryException);
+        assertFalse(covSet.checkCovered());
+    }
+

In this test case, the ledger has an ensemble of 3, write quorum of 3 and ack quorum of 3. I.e. no entry can have been acknowledged to a writer unless it has been written to all 3 nodes. So if a node responses with "No, I don't have it" it means that write was never acknowledged.

BookKeeper was never designed to be arbitrarily able to lose data. That's why we have the journal and why we have cookies. The assumption is that if a bookie acknowledges a write it's not going to turn around and go "nope, never happened". Expecting otherwise is like expecting single decree paxos to still work without stable storage.

However, as @Vanlightly also says, we do have a solution for this (and it doesn't involve fundamentally changing the protocol). The core problem here is a bookie going back on its word. It saw an entry and but now it says it didn't. Technically an arbitrary or byzantine failure. The solution is to turn these arbitrary failures into omission failures. In practice, this works by detecting possible data loss on boot (unclean shutdown or missing cookies), figuring out which entries we may have had (via metadata) and if a read is requested for any of these entries, responding with an error that isn't "NoSuchEntry/Ledger". From the client PoV, the response should be the same as if the bookie never responded at all.

Changing anything in the client is not the solution.

btw, we're in the planning phase of getting this stuff out to OSS. I started a couple of months ago but got buried by other work. Some other folks are picking it up now though.

apache / bookkeeper

Wrong ReadLastAddConfirmed logic that can lead to data loss in client applications #2612