Closed cole-miller closed 1 year ago
(unix-excl
has been included in SQLite since v3.7.6 from 2011; Focal has v3.31.1.)
We still need to touch the wal-index directly in some of our other VFS code, e.g. vfsAmendWalIndexHeader, but hopefully it's possible to port that code over to use the VFS shm accessors (there are probably complications I haven't thought of here).
We still need to touch the wal-index directly in some of our other VFS code, e.g. vfsAmendWalIndexHeader, but hopefully it's possible to port that code over to use the VFS shm accessors (there are probably complications I haven't thought of here).
Specifically, we would have to go through the unix-excl
xShmMap. Calling that from another VFS method might be questionable, I'm not sure. Incidentally https://github.com/sqlite/sqlite/commit/3cb9339a6c59939e29278af733bc3fe868360db7 is a helpful (if outdated) commit to look at on the SQLite side.
We still need to touch the wal-index directly in some of our other VFS code, e.g. vfsAmendWalIndexHeader, but hopefully it's possible to port that code over to use the VFS shm accessors (there are probably complications I haven't thought of here).
Indeed, we do need direct control on the "mmap"-ed wal-index, so I think using the shm-related VFS methods of unix-excl
won't cut it, and the amount of code that would be saved is probably relatively small.
Note that running sqlite3 operations off the main thread would come with a lot of changes throughout the entire dqlite codebase and would probably be close to a rewrite.
What problem are we trying to solve here?
We still need to touch the wal-index directly in some of our other VFS code, e.g. vfsAmendWalIndexHeader, but hopefully it's possible to port that code over to use the VFS shm accessors (there are probably complications I haven't thought of here).
Specifically, we would have to go through the
unix-excl
xShmMap. Calling that from another VFS method might be questionable, I'm not sure. Incidentally sqlite/sqlite@3cb9339 is a helpful (if outdated) commit to look at on the SQLite side.
Yes, in dqlite too we only use heap memory for xShmMap.
So I've been investigating this specifically in the context of running sqlite3_wal_checkpoint_v2 on the thread pool in disk mode. I think this poses fewer fundamental challenges than doing the same with sqlite3_step, although it's certainly possible that I'm overlooking something. In any case, I've been looking into how extensively we'd have to modify our VFS to make this safe.
Indeed, we do need direct control on the "mmap"-ed wal-index, so I think using the shm-related VFS methods of unix-excl won't cut it, and the amount of code that would be saved is probably relatively small.
For reference, I have a branch implementing this idea (incomplete, the disk-mode VFS is not updated yet, but enough to see what kind of changes are entailed I think), using xShmMap to access the wal-index: https://github.com/cole-miller/dqlite/tree/unix-excl. It's certainly possible that I'm missing something, but I'm optimistic: the only places where we need to touch the wal-index ourselves are in functions that don't belong to the sqlite3_vfs or sqlite3_file vtables [edit: except we call vfsAmendWalIndexHeader from vfsFileControlCommitPhaseTwo], which disposes of my biggest worry (that we'd have reentrancy problems/deadlocks).
You're right that the raw number of lines of code that this would allow us to delete is not impressive, but it seems like a win to me if we can get SQLite to do more work for us in an area of the code that's tightly coupled to SQLite -- especially if we're already revisiting that code with concurrency in mind. Does that reasoning make sense? I'm certainly open to being convinced otherwise, especially if there's a problem with that I've overlooked here.
So I've been investigating this specifically in the context of running sqlite3_wal_checkpoint_v2 on the thread pool in disk mode.
My question was more: why you think you need disk mode at all in the first place?
As far as I understand it "disk mode" means "changing the dqlite raft FSM to be disk-based instead of memory-based".
We had a few calls and mail exchanges, but as an outsider, I still don't know what use case Canonical has that makes such a change an immediate hard requirement which can't even be deferred. I understand if we're talking about some private project whose details can't be disclosed. However my personal feeling is that if there is a use case for which dqlite doesn't work in its current form, there are also chances that other databases might be a better fit for that use case.
Note that there are ways to minimize the cpu/disk/memory impact of snapshots without having to switch to a disk-based FSM (e.g. taking and sending snapshots in chunks, which would be a relatively easy improvement). A disk-based FSM (aka disk mode) is only needed for workloads that have data sets that don't fit in memory (and again, chances are that for many of those workloads other databases might be a better fit).
All in all, before deciding for any solution I think it'd be important to have numbers and details about the problem being faced.
I think this poses fewer fundamental challenges than doing the same with sqlite3_step, although it's certainly possible that I'm overlooking something. In any case, I've been looking into how extensively we'd have to modify our VFS to make this safe.
I didn't yet reason about running sqlite3_wal_checkpoint_v2 in a separate thread, and the implications it has. However the current form of disk mode where the VFS makes blocking I/O calls against the file system is already problematic in my mind as outlined in this comment.
Indeed, we do need direct control on the "mmap"-ed wal-index, so I think using the shm-related VFS methods of unix-excl won't cut it, and the amount of code that would be saved is probably relatively small.
For reference, I have a branch implementing this idea (incomplete, the disk-mode VFS is not updated yet, but enough to see what kind of changes are entailed I think), using xShmMap to access the wal-index: https://github.com/cole-miller/dqlite/tree/unix-excl. It's certainly possible that I'm missing something, but I'm optimistic: the only places where we need to touch the wal-index ourselves are in functions that don't belong to the sqlite3_vfs or sqlite3_file vtables [edit: except we call vfsAmendWalIndexHeader from vfsFileControlCommitPhaseTwo], which disposes of my biggest worry (that we'd have reentrancy problems/deadlocks).
You're right that the raw number of lines of code that this would allow us to delete is not impressive, but it seems like a win to me if we can get SQLite to do more work for us in an area of the code that's tightly coupled to SQLite -- especially if we're already revisiting that code with concurrency in mind. Does that reasoning make sense? I'm certainly open to being convinced otherwise, especially if there's a problem with that I've overlooked here.
As far as I understand the code that would be saved is essentially the one in the vfsShmMap function, which seems already short and straightforward. Keeping the current code gives us the advantage of being able to perform assertions in order to validate the expectations we have about the use that SQLite does of this API (should that change in a way that matters to dqlite). Leveraging unix-excl would come with a bit more indirection than we have now. All in all, I'm not that sure it'd be a net win, but YMMV.
So I've been investigating this specifically in the context of running sqlite3_wal_checkpoint_v2 on the thread pool in disk mode.
I'm not sure if one the goals of this investigation was to implement incremental snapshots, along the lines of @MathieuBordere described, maybe not, but if we consider only incremental snapshots for the current in-memory FSM (i.e. not disk mode) it shouldn't be necessary to run sqlite3_wal_checkpoint_v2
in the threadpool.
A possible approach could be roughly something like:
sqlite3_wal_checkpoint_v2
in the main thread normally, as we do now.xWrite
calls in our VFS, those should behave just as they behave now (i.e. they overwrite pages in the in-memory database file), but on top of that behavior our VFS should also record the index of all pages that are being updated by the checkpoint. This is similar to what we already to to record write transactions that need to be replicated via raft.sqlite3_wal_checkpoint_v2
returns, we would run a job in libuv's threadpool that updates the database file contained in the last snapshot, using the updated pages from the in-memory database file that have indexes matching the recorded ones.<snapshot-filename>.journal
) all the current non-updated pages in the on-disk snapshot database file that are going to be replaced with new pages copied from the in-memory database file. Then the job would start to actually update the existing snapshot and rename it when done. The journal file could be used to make it possible to rollback the snapshot file to its original content in case raft crashes before the job fully completes the update, and the rollback would take place when raft restarts.I'm not sure if one the goals of this investigation was to implement incremental snapshots, along the lines of @MathieuBordere described, maybe not, but if we consider only incremental snapshots for the current in-memory FSM (i.e. not disk mode) it shouldn't be necessary to run
sqlite3_wal_checkpoint_v2
in the threadpool.
It wasn't, we were thinking about it in context of disk-mode.
SQLite provides a built-in
unix-excl
VFS implementation, documented as follows:(emphasis mine)
(from the docstring for
unixFileLock
in sqlite3.c -- "system locking" here refers to file locking afaict)I include the second quote to clarify that
unix-excl
does support concurrent connections from multiple threads in the same process, which I previously believed it didn't (I mixed it up withPRAGMA locking_mode=EXCLUSIVE
).It sounds like we could use unix-excl to replace our own in-memory implementation of the shm-related VFS methods. Besides allowing us to delete some code, this would simplify our lives if/when we decide to run some sqlite3 operations (sqlite3_step or sqlite3_wal_checkpoint_v2) off the main thread.
Does this proposal make sense, or am I missing something?