Can we use the unix-excl VFS?

cole-miller commented 1 year ago

SQLite provides a built-in unix-excl VFS implementation, documented as follows:

unix-excl - obtains and holds an exclusive lock on database files, preventing other processes from accessing the database. Also keeps the wal-index in heap rather than in shared memory.

(emphasis mine)

If the pFile was opened read/write from unix-excl, then the only lock ever obtained is an exclusive lock, and it is obtained exactly once the first time any lock is attempted. All subsequent system locking operations become no-ops. Locking operations still happen internally, in order to coordinate access between separate database connections within this process, but all of that is handled in memory and the operating system does not participate.

(from the docstring for unixFileLock in sqlite3.c -- "system locking" here refers to file locking afaict)

I include the second quote to clarify that unix-excl does support concurrent connections from multiple threads in the same process, which I previously believed it didn't (I mixed it up with PRAGMA locking_mode=EXCLUSIVE).

It sounds like we could use unix-excl to replace our own in-memory implementation of the shm-related VFS methods. Besides allowing us to delete some code, this would simplify our lives if/when we decide to run some sqlite3 operations (sqlite3_step or sqlite3_wal_checkpoint_v2) off the main thread.

Does this proposal make sense, or am I missing something?

cole-miller commented 1 year ago

(unix-excl has been included in SQLite since v3.7.6 from 2011; Focal has v3.31.1.)

cole-miller commented 1 year ago

We still need to touch the wal-index directly in some of our other VFS code, e.g. vfsAmendWalIndexHeader, but hopefully it's possible to port that code over to use the VFS shm accessors (there are probably complications I haven't thought of here).

cole-miller commented 1 year ago

We still need to touch the wal-index directly in some of our other VFS code, e.g. vfsAmendWalIndexHeader, but hopefully it's possible to port that code over to use the VFS shm accessors (there are probably complications I haven't thought of here).

Specifically, we would have to go through the unix-excl xShmMap. Calling that from another VFS method might be questionable, I'm not sure. Incidentally https://github.com/sqlite/sqlite/commit/3cb9339a6c59939e29278af733bc3fe868360db7 is a helpful (if outdated) commit to look at on the SQLite side.

freeekanayaka commented 1 year ago

We still need to touch the wal-index directly in some of our other VFS code, e.g. vfsAmendWalIndexHeader, but hopefully it's possible to port that code over to use the VFS shm accessors (there are probably complications I haven't thought of here).

Indeed, we do need direct control on the "mmap"-ed wal-index, so I think using the shm-related VFS methods of unix-excl won't cut it, and the amount of code that would be saved is probably relatively small.

Note that running sqlite3 operations off the main thread would come with a lot of changes throughout the entire dqlite codebase and would probably be close to a rewrite.

What problem are we trying to solve here?

freeekanayaka commented 1 year ago

We still need to touch the wal-index directly in some of our other VFS code, e.g. vfsAmendWalIndexHeader, but hopefully it's possible to port that code over to use the VFS shm accessors (there are probably complications I haven't thought of here).

Specifically, we would have to go through the unix-excl xShmMap. Calling that from another VFS method might be questionable, I'm not sure. Incidentally sqlite/sqlite@3cb9339 is a helpful (if outdated) commit to look at on the SQLite side.

Yes, in dqlite too we only use heap memory for xShmMap.

cole-miller commented 1 year ago

So I've been investigating this specifically in the context of running sqlite3_wal_checkpoint_v2 on the thread pool in disk mode. I think this poses fewer fundamental challenges than doing the same with sqlite3_step, although it's certainly possible that I'm overlooking something. In any case, I've been looking into how extensively we'd have to modify our VFS to make this safe.

Indeed, we do need direct control on the "mmap"-ed wal-index, so I think using the shm-related VFS methods of unix-excl won't cut it, and the amount of code that would be saved is probably relatively small.

For reference, I have a branch implementing this idea (incomplete, the disk-mode VFS is not updated yet, but enough to see what kind of changes are entailed I think), using xShmMap to access the wal-index: https://github.com/cole-miller/dqlite/tree/unix-excl. It's certainly possible that I'm missing something, but I'm optimistic: the only places where we need to touch the wal-index ourselves are in functions that don't belong to the sqlite3_vfs or sqlite3_file vtables [edit: except we call vfsAmendWalIndexHeader from vfsFileControlCommitPhaseTwo], which disposes of my biggest worry (that we'd have reentrancy problems/deadlocks).

You're right that the raw number of lines of code that this would allow us to delete is not impressive, but it seems like a win to me if we can get SQLite to do more work for us in an area of the code that's tightly coupled to SQLite -- especially if we're already revisiting that code with concurrency in mind. Does that reasoning make sense? I'm certainly open to being convinced otherwise, especially if there's a problem with that I've overlooked here.

freeekanayaka commented 1 year ago

So I've been investigating this specifically in the context of running sqlite3_wal_checkpoint_v2 on the thread pool in disk mode.

My question was more: why you think you need disk mode at all in the first place?

As far as I understand it "disk mode" means "changing the dqlite raft FSM to be disk-based instead of memory-based".

We had a few calls and mail exchanges, but as an outsider, I still don't know what use case Canonical has that makes such a change an immediate hard requirement which can't even be deferred. I understand if we're talking about some private project whose details can't be disclosed. However my personal feeling is that if there is a use case for which dqlite doesn't work in its current form, there are also chances that other databases might be a better fit for that use case.

Note that there are ways to minimize the cpu/disk/memory impact of snapshots without having to switch to a disk-based FSM (e.g. taking and sending snapshots in chunks, which would be a relatively easy improvement). A disk-based FSM (aka disk mode) is only needed for workloads that have data sets that don't fit in memory (and again, chances are that for many of those workloads other databases might be a better fit).

All in all, before deciding for any solution I think it'd be important to have numbers and details about the problem being faced.

I think this poses fewer fundamental challenges than doing the same with sqlite3_step, although it's certainly possible that I'm overlooking something. In any case, I've been looking into how extensively we'd have to modify our VFS to make this safe.

I didn't yet reason about running sqlite3_wal_checkpoint_v2 in a separate thread, and the implications it has. However the current form of disk mode where the VFS makes blocking I/O calls against the file system is already problematic in my mind as outlined in this comment.

Indeed, we do need direct control on the "mmap"-ed wal-index, so I think using the shm-related VFS methods of unix-excl won't cut it, and the amount of code that would be saved is probably relatively small.

For reference, I have a branch implementing this idea (incomplete, the disk-mode VFS is not updated yet, but enough to see what kind of changes are entailed I think), using xShmMap to access the wal-index: https://github.com/cole-miller/dqlite/tree/unix-excl. It's certainly possible that I'm missing something, but I'm optimistic: the only places where we need to touch the wal-index ourselves are in functions that don't belong to the sqlite3_vfs or sqlite3_file vtables [edit: except we call vfsAmendWalIndexHeader from vfsFileControlCommitPhaseTwo], which disposes of my biggest worry (that we'd have reentrancy problems/deadlocks).

You're right that the raw number of lines of code that this would allow us to delete is not impressive, but it seems like a win to me if we can get SQLite to do more work for us in an area of the code that's tightly coupled to SQLite -- especially if we're already revisiting that code with concurrency in mind. Does that reasoning make sense? I'm certainly open to being convinced otherwise, especially if there's a problem with that I've overlooked here.

As far as I understand the code that would be saved is essentially the one in the vfsShmMap function, which seems already short and straightforward. Keeping the current code gives us the advantage of being able to perform assertions in order to validate the expectations we have about the use that SQLite does of this API (should that change in a way that matters to dqlite). Leveraging unix-excl would come with a bit more indirection than we have now. All in all, I'm not that sure it'd be a net win, but YMMV.

freeekanayaka commented 1 year ago

So I've been investigating this specifically in the context of running sqlite3_wal_checkpoint_v2 on the thread pool in disk mode.

I'm not sure if one the goals of this investigation was to implement incremental snapshots, along the lines of @MathieuBordere described, maybe not, but if we consider only incremental snapshots for the current in-memory FSM (i.e. not disk mode) it shouldn't be necessary to run sqlite3_wal_checkpoint_v2 in the threadpool.

A possible approach could be roughly something like:

Run sqlite3_wal_checkpoint_v2 in the main thread normally, as we do now.
That will trigger xWrite calls in our VFS, those should behave just as they behave now (i.e. they overwrite pages in the in-memory database file), but on top of that behavior our VFS should also record the index of all pages that are being updated by the checkpoint. This is similar to what we already to to record write transactions that need to be replicated via raft.
After sqlite3_wal_checkpoint_v2 returns, we would run a job in libuv's threadpool that updates the database file contained in the last snapshot, using the updated pages from the in-memory database file that have indexes matching the recorded ones.
In particular, this should be done in a transactional way. For example the job could first copy to a "journal" file (e.g. <snapshot-filename>.journal) all the current non-updated pages in the on-disk snapshot database file that are going to be replaced with new pages copied from the in-memory database file. Then the job would start to actually update the existing snapshot and rename it when done. The journal file could be used to make it possible to rollback the snapshot file to its original content in case raft crashes before the job fully completes the update, and the rollback would take place when raft restarts.

MathieuBordere commented 1 year ago

I'm not sure if one the goals of this investigation was to implement incremental snapshots, along the lines of @MathieuBordere described, maybe not, but if we consider only incremental snapshots for the current in-memory FSM (i.e. not disk mode) it shouldn't be necessary to run sqlite3_wal_checkpoint_v2 in the threadpool.

It wasn't, we were thinking about it in context of disk-mode.

canonical / dqlite

Can we use the unix-excl VFS? #528