canonical / dqlite

Embeddable, replicated and fault-tolerant SQL engine.
https://dqlite.io
Other
3.87k stars 216 forks source link

Potential linearalizability issues? #258

Closed mt-caret closed 4 years ago

mt-caret commented 4 years ago

Hi, I've started to work on implementing a Jepsen test for dqlite. I wanted to start out small, so I built the go-dqlite demo application and used it as a key-value store; as soon as I tried to concurrently read and write to the five-node dqlite cluster, I get linearizability violations like the following:

2020-08-19_15:14:50

The vertical axis is time (down is forward in time) and the horizontal axis corresponds to the multiple threads of access. The boxes are reads and writes, with the height corresponding to how long each operation took and the color corresponding to whether the operation succeeded or failed (blue is success and pink is error).

The top row of errors are non-problematic (Database is locked or not found errors), but notice the bottom left three boxes in blue. After successfully writing a value of 3, Jepsen immediately reads the value 0 two times. My understanding was that dqlite essentially threads all SQLite operations through Raft which suggests that the go-dqlite demo application should offer linearizable reads and writes wrt key-value pairs; this seems to indicate otherwise.

freeekanayaka commented 4 years ago

First of all, thanks for the report! Timing is actually perfect because we're about to start to work on Jepsen tests ourselves.

The consistency model of dqlite is virtually the same one as SQLite with WAL journal mode. And that's basically single writer and possibly multiple concurrent readers with snapshot isolation. That's different form the typical linearizable model of a straight key-value store replicated via raft (e.g. etcd). Basically there's a bit of mapping from SQL-like consistency (snapshot isolation) to raft-like consistency.

With dqlite's model, the first two of the three bottom left boxes could be explained with a history like this:

I'm not totally sure about the third box: if the same thread of execution serially writes a value and then reads it back it should definitely see its write. I suspect the test is actually making something which is not strictly a serial write and then a serial read, and so it falls into a history similar to the one I described above. I'll have to grab your test and look at it in detail.

I also need to document dqlite's consistency model because it's definitely not obvious, and one might assume that it's similar to a plain linearizable key/value store, which is not.

I'm also aware of a number of corner cases of the dqlite implementation that need to be addressed and that will make more "aggressive" jepsen tests to show issues (and actually the third box anomaly above might be explained by that, remains to be seen). I'll work in the next weeks and months to address those.

mt-caret commented 4 years ago

Timing is actually perfect because we're about to start to work on Jepsen tests ourselves.

That's great to hear! Please let me know if there's anything that I can do to help.

By the way, here are the files (traces, timelines, etc.) for the Jepsen run in question: latest.zip

freeekanayaka commented 4 years ago

Timing is actually perfect because we're about to start to work on Jepsen tests ourselves.

That's great to hear! Please let me know if there's anything that I can do to help.

You're already helping, I'm totally new to Jepsen, so I think I'll use your code to get started playing with it. I'm not NixOS user, but I should figure out how to deploy that on Ubuntu :)

freeekanayaka commented 4 years ago

@mt-caret after a little bit of digging it turns out that there's a small error in the code of this jepsen test:

1 file changed, 1 insertion(+), 1 deletion(-)
src/jepsen/dqlite.clj | 2 +-

modified   src/jepsen/dqlite.clj
@@ -71,7 +71,7 @@

 (defn run-write
   [conn key value]
-  (let [result (str (:body (http/put (str conn "/" value)
+  (let [result (str (:body (http/put (str conn "/" key)
                             {:body (str value)
                             :socket-timeout 5000
                             :connection-timeout 5000 })))]

So the key being written was not really then intended one. With that fix in place the only linearization errors seem to be the expected ones that I described, i.e. since the model is snapshot isolation concurrent reads of "stale" values are possible.

I'll run the test more times to confirm no other anomaly shows up.

freeekanayaka commented 4 years ago

After more analysis and thinking I'll amend a few things from my very first comment:

With dqlite's model, the first two of the three bottom left boxes could be explained with a history like this:

* The Go client in thread 1 starts a read transaction, and hence "enters" a database snapshot where the value of the key is `0`

* The Go client in thread 0 starts and completes a write transaction, updating the value of the key to `3`

* The Go client in thread 1 completes the read and sees the value `0` which is the one contained in the snapshot of the database taken at the time the read transaction started and does **not** see the value `3` which was written **after** the read transaction started

Here I was misunderstanding the meaning of the boxes in the picture. I thought those where just completion points, but they actually visualize the entire duration of an operation (the upper edge of the box is the start time of the operation and the bottom edge of the box is the completion time).

So the history I described above doesn't match the one depicted by the boxes. The actual explanation for those boxes lies in the bug in the test code that I reported.

Also, it turns out that the history I described is not really a stale read and it is is compatible with a linarizable model. That's because in a linearizable model if a read starts before a write, it can legally return either the value before that write or the value after that write. Only if the read starts after a write, then it must return the value of that write or a later value.

This means that the key-value store implemented by the dqlite demo actually is linearizable, and indeed all runs of the jepsen test that I did succeed.

The anomalies I initially thought I had seen were actually a second issue in the test code (or at least in the way I ran it), for which the database was not wiped between tests:

1 file changed, 1 insertion(+)
src/jepsen/dqlite.clj | 1 +

modified   src/jepsen/dqlite.clj
@@ -31,6 +31,7 @@
   (reify db/DB
     (setup! [_ test node]
       (info node "installing dqlite" version)
+      (c/su (c/exec "rm" "-rf" rundir))
       (c/su (c/exec "mkdir" "-p" rundir))
       (c/su (c/exec "chown" "dqlite" rundir))
       (c/su (c/exec "systemctl" "start" "dqlite"))

I also need to document dqlite's consistency model because it's definitely not obvious, and one might assume that it's similar to a plain linearizable key/value store, which is not.

Thinking about this a bit more, I realized that the dqlite's consistency model is actually strict serializable, which implies both linearizable and serializable. It's serializable because it's transactional: every operation is a SQL transaction and can span all objects in the database, and transactions can't interleave (there can be only one writer at a time and commits are atomic). It's linearizable because there are real time bounds: once a transaction is committed it immediately becomes visible to all transactions started after the commit (there can't be stale reads, in that sense).

I'm going to close this issue, although I'll continue to put in place more jepsen tests to ensure that the implementation is actually strict serializable as it's supposed to be, even in case of faults. I'm sure there will be plenty of bugs to deal with as soon as we start injecting failures :)