canonical / dqlite

Embeddable, replicated and fault-tolerant SQL engine.
https://dqlite.io
Other
3.77k stars 212 forks source link

Add file locking to prevent two dqlite instances from using the same directory concurrently #656

Closed cole-miller closed 1 week ago

cole-miller commented 1 month ago

It was suggested that some of the corruption issues reported by LXD users might be due to two LXD daemon processes running concurrently, causing two dqlite instances to concurrently modify the common data directory. LXD already has some mutual exclusion logic to prevent two daemons from running at the same time:

https://github.com/canonical/lxd/blob/1514a400f11a82b90a5294b5c1f31cd2c6dd9311/lxd/endpoints/socket.go#L39

But even so it seems worth it to do some file locking to prevent this definitively on the dqlite side.

Note that this obviously won't work unless both the contending processes are running versions of dqlite that include the locking.

Signed-off-by: Cole Miller cole.miller@canonical.com

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 82.69231% with 9 lines in your changes missing coverage. Please review.

Project coverage is 77.42%. Comparing base (6633bd8) to head (294464b). Report is 29 commits behind head on master.

Files Patch % Lines
src/server.c 73.52% 7 Missing and 2 partials :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #656 +/- ## ======================================= Coverage 77.41% 77.42% ======================================= Files 196 196 Lines 27269 27311 +42 Branches 5455 5455 ======================================= + Hits 21111 21146 +35 - Misses 4297 4301 +4 - Partials 1861 1864 +3 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

cole-miller commented 1 month ago

cc @tomponline

tomponline commented 1 month ago

Thanks for this @cole-miller

Once this is merged it'll be auto integrated on next build into LXD's latest/edge channel, and then tested through our CI as well as our daily snap CI (https://github.com/canonical/lxd-ci/actions/workflows/tests.yml?query=event%3Aschedule)

So we'll see if there are any breakages then.

Also thanks for checking if LXD checks for existing processes. I suppose it is possible potentially for LXD's unix socket to have been closed before the DB is closed and thus allow a new LXD process to start and open the DB concurrently - ill double check this our side.

tomponline commented 1 month ago

@cole-miller seems like LXD does correctly shutdown its local unix listener after closing the DB:

https://github.com/canonical/lxd/blob/main/lxd/daemon.go#L1899-L1917

cole-miller commented 1 month ago

Thanks @tomponline, in that case it's unlikely that concurrent use of the directory is the issue for LXD. I should have a PR up soon that adds the additional instrumentation we've discussed, so if/when the problem recurs we can understand it better. In the meantime I'm inclined to land this change (pending review) since it eliminates a footgun for users of dqlite that don't have their own exclusiveness checking.