canonical / dqlite

Embeddable, replicated and fault-tolerant SQL engine.
https://dqlite.io
Other
3.83k stars 216 forks source link

Moving sqlite3_step and other database operations off the main thread #481

Open cole-miller opened 1 year ago

cole-miller commented 1 year ago

Right now, dqlite runs sqlite3_step and other database operations on the libuv main thread. This is problematic for disk mode, because such operations can take a long time when they need to write to disk, so that the event loop is blocked from handling other tasks (like those associated with incoming Raft messages). We also have some data from canonical/microk8s#3227 indicating that calls to sqlite3_step in particular frequently run for milliseconds to tens of milliseconds even for an an in-memory database, which is long enough to potentially cause problems when run on the main thread. We should fix our code to run sqlite3_step and perhaps other functions on the libuv thread pool, or in a dedicated thread.

cole-miller commented 1 year ago

I've been looking into how to implement this. My initial plan was to have dqlite spawn a dedicated thread on startup, communicating with the main thread via a queue, and have the main thread offload all sqlite3 calls that touch a database to this thread. This would preserve a current property of the codebase: all such database operations happen on a single thread (right now, the main thread). This would make it sound to do sqlite3_config(SQLITE_SINGLE_THREAD) in the dqlite process before startup, possibly improving performance.

This strategy has a problem: both sqlite3_step and our FSM implementation need to touch the VFS data structures, and the synchronous character of the raft_fsm interface forces us to perform the FSM-related VFS operations on the main thread. So if we want to run sqlite3_step on a dedicated thread, we need to make sure the VFS implementation is prepared for concurrent accesses, which it currently is not. I would like to fix that, but first I need to understand to what extent SQLite sychronizes its own calls to VFS functions -- i.e., imagining for the moment that we never touch the VFS data structures directly, does SQLite take care of serializing its own calls to xWrite, xShmMap, etc., or we have to use our own locks that we acquire in the implementation of those methods? @freeekanayaka, can you shed any light here?

(Instead of using a dedicated thread, we could run sqlite3_step on the libuv thread pool using uv_queue_work, but this still requires making the VFS thread-safe.)

freeekanayaka commented 1 year ago

Ok, everything I'm going to say starts from the assumption that this issue is caused by:

1) disk-mode being turned on in microk8s 2) kine implementation being inefficient at emulating etcd in terms of SQL

If either 1) or 2) is wrong, please let me know and probably what I'm going to say can be discarded and reading it just a waste of time :)

So, it feels that the concerns that we had raised when discussing disk-mode are actually realizing, I'd encourage everybody to re-read https://github.com/canonical/dqlite/issues/368 carefully, because I believe it mentions pretty much everything, from the issues that could happen with the "cheap" approach of disk-mode that has been put in place (i.e. simply storing the database on disk and using blocking I/O to access it) to the reasons why just moving sqlite3_step (or any other part of the call stack) to a thread won't work per se.

If you bear with letting me to take again a step back, and talk from a high-level architectural/design point of view, I'll just re-iterate what I've said since the beginning of the microk8s/kine/dqlite idea years ago:

I know I'm beating a dead horse, since years now, and apologies for the directness but unfortunately I feel things have been kind of put under the carpet and duck-taped all along, and eventually reality knocks the doors. I'm well aware that all the developers involved have little choice, since this all comes from management. But probably management never fully realized the technical ramifications of this endeavor.

Ok, enough ranting and big picture, sorry about that.

Regarding @cole-miller's questions, yes SQLite can absolutely "take care of serializing its own calls to xWrite, xShmMap, etc", or, to put it more precisely, SQLite supports several modes of operations with several thread-safety guarantees (see the docs) and in order to do that it will use locks. So SQLite and threads can be friends, no problems at all.

The issue though is the synchronous/asynchronous mismatch that you allude to. SQLite is synchronous in every aspect, in particular the VFS. While dqlite/raft are asynchronous, basically because the asynchronous approach is more efficient and simpler (in some way) when deal with the network, something SQLite does not have to do.

To deal with this mismatch we basically do something a bit tricky and unconventional, along the these lines:

Basically SQLite is being tricked, it makes some synchronous VFS calls, but the final actual result happens asynchronously. I really hope I'm explaining this well enough.

Note that this all works because:

1) real disk I/O is never performed by our VFS implementation. Disk I/O is only performed by raft, in an asynchronous way 2) everything happens in a single thread so our VFS implementation can make a lot of assumptions and checks about the current state (both of SQLite and the VFS/dqlite itself). without worrying about concurrency

Now, if you want to change 1) or 2) or both, the ramifications are quite deep.

Note that avoiding threads and context switches is what makes dqlite very fast and efficient, for example compared to other distributed SQLite implementations (see the blog post I pointed out a while back).

Introducing threads into the equation is going to degrade performance and increase complexity a lot, no matter the approach (running sqlite3_step in a thread, or just running disk I/O in a thread, or whatever).

We can discuss the details about the feasibility of that and how to achieve it, but I really hope we're all on the same page and understand it's a complex and deep change that is close to a rewrite of the core engine. That's what I was trying to say in #368, when the effort was started.

Thanks for bearing with me and reading this much, I hope this story was shorter but I don't think it really is.

MathieuBordere commented 1 year ago

Thanks for the feedback, so assumption 1) is wrong at the moment, microk8s is still using vanilla in-memory dqlite.

freeekanayaka commented 1 year ago

Thanks for the feedback, so assumption 1) is wrong at the moment, microk8s is still using vanilla in-memory dqlite.

Okay, then it might be sort of a kine-related issue, in the sense that the SQL-based emulation of the etcd model puts too much pressure on dqlite/SQLite. As mentioned in https://github.com/canonical/microk8s/issues/3227, having a better idea of what's going on at the kine level is probably going to help. At that point one would ideally know what's so heavy in kine, and maybe be able to come up with a different design.

freeekanayaka commented 1 year ago

By the way, why microk8s was that user saying that he started see problems only from a certain version onwards? What changed in that version? Or is he having the wrong impression?

cole-miller commented 1 year ago

but first I need to understand to what extent SQLite sychronizes its own calls to VFS functions -- i.e., imagining for the moment that we never touch the VFS data structures directly, does SQLite take care of serializing its own calls to xWrite, xShmMap, etc., or we have to use our own locks that we acquire in the implementation of those methods?

Regarding @‌cole-miller's questions, yes SQLite can absolutely "take care of serializing its own calls to xWrite, xShmMap, etc", or, to put it more precisely, SQLite supports several modes of operations with several thread-safety guarantees (see the docs) and in order to do that it will use locks. So SQLite and threads can be friends, no problems at all.

I've come around again to this question while working on something else related to disk mode, and I think my confusion is not 100% resolved yet. I understand that SQLite provides thread-safety guarantees for operations on database connections, when the appropriate settings are in place. That guarantee is relevant, for example, when you have two threads calling sqlite3_prepare_v2 against the same sqlite3 object concurrently. SQLite internally synchronizes access to its own data structures to make this kind of thing safe.

What I'm not clear on is what the guarantees/expectations are for a VFS implementation like ours that will mess around with its own data structures when SQLite calls into it to implement sqlite3_step or sqlite3_wal_checkpoint_v2 or etc. Does SQLite guarantee to the VFS implementor that these high-level operations are still thread-safe even if the VFS doesn't manually synchronize its own accesses to those private data structures? This is not obvious to me either way. The various unix VFSs certainly seem to do some internal locking, and so does the "quota" example VFS. On the other hand, I can't find an explicit statement in the documentation that out-of-tree VFSs are expected to do their own synchronization.

Of course I could be overlooking something obvious or just missing the point here.

freeekanayaka commented 1 year ago

but first I need to understand to what extent SQLite sychronizes its own calls to VFS functions -- i.e., imagining for the moment that we never touch the VFS data structures directly, does SQLite take care of serializing its own calls to xWrite, xShmMap, etc., or we have to use our own locks that we acquire in the implementation of those methods?

Regarding @‌cole-miller's questions, yes SQLite can absolutely "take care of serializing its own calls to xWrite, xShmMap, etc", or, to put it more precisely, SQLite supports several modes of operations with several thread-safety guarantees (see the docs) and in order to do that it will use locks. So SQLite and threads can be friends, no problems at all.

I've come around again to this question while working on something else related to disk mode, and I think my confusion is not 100% resolved yet. I understand that SQLite provides thread-safety guarantees for operations on database connections, when the appropriate settings are in place. That guarantee is relevant, for example, when you have two threads calling sqlite3_prepare_v2 against the same sqlite3 object concurrently. SQLite internally synchronizes access to its own data structures to make this kind of thing safe.

What I'm not clear on is what the guarantees/expectations are for a VFS implementation like ours that will mess around with its own data structures when SQLite calls into it to implement sqlite3_step or sqlite3_wal_checkpoint_v2 or etc. Does SQLite guarantee to the VFS implementor that these high-level operations are still thread-safe even if the VFS doesn't manually synchronize its own accesses to those private data structures? This is not obvious to me either way. The various unix VFSs certainly seem to do some internal locking, and so does the "quota" example VFS. On the other hand, I can't find an explicit statement in the documentation that out-of-tree VFSs are expected to do their own synchronization.

Of course I could be overlooking something obvious or just missing the point here.

The VFS interface basically maps more or less directly to the equivalent syscalls provided by the OS, for instance xRead() maps to read(), and virtually has the same expectations concurrency-wise.

Note that when in WAL mode the methods that effectively matter in this regard are effectively only xRead() and xWrite(), with xWrite() guaranteed to be append-only (except for checkpoints) and hence not interleave with xRead() for the same pages (and when checkpoints are run the non-interleaving guarantee is still preserved by only checkpointig pages that have no active readers). Of course xRead() calls can interleave for the same pages.

So normal OS semantics is what SQLite expects, and on top of that you can add the SQLite-specific read/write pattern described above to optionally tighten those expectations when writing a VFS (although that's technically a bit of a trick since it relies on the SQLite data access patterns). If a VFS has internal data structures that it needs to synchronize access to, that's an implementation detail transparent to SQLite.

cole-miller commented 1 year ago

The VFS interface basically maps more or less directly to the equivalent syscalls provided by the OS, for instance xRead() maps to read(), and virtually has the same expectations concurrency-wise.

Yes, this makes sense to me, and the analogy to syscalls is part of what made me suspect that internal locking is still required for our own VFS data structures.

If a VFS has internal data structures that it needs to synchronize access to, that's an implementation detail transparent to SQLite.

This is mainly what I was wondering about, and it sounds like we came to the same conclusion here :)