canonical / dqlite

Embeddable, replicated and fault-tolerant SQL engine.
https://dqlite.io
Other
3.8k stars 214 forks source link

Write operations that immediately follow write operations sometimes cause a disk I/O-error, followed by loss of leadership and high latency #522

Open fbrandherm opened 1 year ago

fbrandherm commented 1 year ago

I am using dqlite (version 1.14) for an internal project and I observed some unexpected behavior in my benchmarks (on localhost): If I rapidly spam write-operations (INSERT OR REPLACE INTO kv_table (KEY, VALUE) VALUES (?,?);, using request type 8 of the wire protocol), there are some random latency spikes (see picture) that do not appear, if I wait 1ms between requests. What happens is that these outlier requests return SQLite's "disk I/O error", and retrying the request returns "not leader" for some time. I suspect what happens is that this bug triggers a leader election. The files are on a ramdisk and I cannot reproduce the bug if the files are on an SSD, so the bug is probably timing-related.

dqlite-io-erros Regarding the plot: blue dots are 100 write operations on node 1, red dots are 100 read-operations on node 2 (1st red dot is a leadership transfer to node 2). There were 3 voting nodes in the cluster.

MathieuBordere commented 1 year ago

Can you share your code to reproduce this?

fbrandherm commented 1 year ago

Sorry, but I can't share the full code since it's a large project that uses DQLite as a backend behind a lot of other logic and isn't open sourced (yet). I'm sure it could be reproduced by much simpler code, but I don't have the time to implement a simple demo reproducing the bug until the end of the month. I should note however, that my code is using a custom client implemented in C++, which could also make a difference.

MathieuBordere commented 1 year ago

No problem, we'll try to reproduce this.