NixOS / nix

Nix, the purely functional package manager
https://nixos.org/
GNU Lesser General Public License v2.1
12.12k stars 1.47k forks source link

sqlite VFS change causes db corruption when mixing nix 2.5+ and older versions when WAL is disabled #7396

Open gbpdt opened 1 year ago

gbpdt commented 1 year ago

TL;DR

nix 2.5+ uses a different locking mechanism from 2.3 when sqlite WAL is disabled, leading to corruption when the two versions are mixed. It'd be great to at least have a way to control this via config.

Description of the Problem

We've been seeing sqlite database corruption when mixing nix 2.3.11 and nix 2.5+ on the same nix stores for some time now, and we believe we have finally tracked down the cause to this change: https://github.com/NixOS/nix/pull/5475, which switches the sqlite VFS implementation from the default "unix" to "unix-dotfile" when sqlite WAL is disabled in Nix (use-sqlite-wal = false). This change appears to be motivated by this issue: https://github.com/NixOS/nix/issues/2357.

The overall problem is simply that the two different VFS implementations use completely different locking mechanisms: "unix" uses POSIX advisory locks, "unix-dotfile" simply uses the creation of a directory on the filesystem. Hence if you have concurrent use of the two on the same Nix store, they will each be making changes to the database with no awareness of the other, leading to destructive updates and corruption. In our case, this has manifested in different ways: Nix failing with malformed database errors, the database getting out of sync with paths on disk, and sqlite integrity_check and foreign_key_check failures. We have now patched out the above change from our 2.5+ build of Nix (specifically 2.11.1), and we no longer see any problems mixing this with nix 2.3.11.

We're filing this issue for two reasons: firstly as information for anyone who might hit similar issues, and secondly to debate whether "unix-dotfile" is really a good choice in this case, although we appreciate it might be difficult to change.

Possible Improvements

On the first point (information), this VFS change does not appear to be mentioned in the Nix 2.5 release notes (https://nixos.org/manual/nix/stable/release-notes/rl-2.5.html). Is it possible to update them retrospectively? Nix 2.3 is (I believe) still supported for use in NixOS, even with NixOS 22.11, so it seems somewhat reasonable to try to use it with later versions (although most users will likely have sqlite WAL enabled so will not hit this). I should note that this change was backported to Nix 2.3 in the 2.3-maintenance branch (https://github.com/NixOS/nix/commit/addb547b9dab2590a27c37d7605bdd5064701fb5), but this was after the most recent maintenance release (2.3.16), so there isn't a released version of Nix 2.3 that works well with 2.5+, to my knowledge

On the face of it, this change seems quite similar to the change of the Nix store locking mechanism from POSIX to BSD locks in Nix 2.3, which was well advertised in the release notes (https://nixos.org/manual/nix/stable/release-notes/rl-2.3.html).

On the second point (whether this change is a good one), I'll first note that it appears to be a well-intentioned improvement to fix problems when using Nix stores on NFS and on WSL. However we have been using Nix stores on NFS (v3, on Redhat Enterprise Linux, with WAL disabled) for many years at this point and have not seen any database corruption at all, until we hit this issue. I expect there are real problems to solve here (some of the linked threads mention NFS issues on MacOS, and presumably the WSL problems mentioned are real too, but perhaps separate from NFS) but from our experience it's not the case that all NFS users need this change.

One question is whether the disabling of sqlite WAL is really the right 'switch' to use to determine when to change VFS to "unix-dotfiles". NFS users will definitely disable WAL, and apparently so should WSL (1?) users, but it's not necessarily the case that every user disabling WAL is using NFS or WSL, and even if they are they might not want the "unix-dotfile" VFS. The mailthread linked in the original issue that appears to recommend using "unix-dotfile" for NFS doesn't seem to be conclusive. Here's a working link: https://sqlite-users.sqlite.narkive.com/crBlI5hG/sqlite-safe-sqlite-over-remote-filesystem. In particular, the tone of it seems to be that this should only be done if problems are encountered, and users report sharing sqlite databases over NFS with no issues.

This might be moot if the "unix-dotfiles" VFS was a good alternative to "unix" with similar features, but it doesn't appear to be. The sqlite codebase describes it as follows:

The dotfile locking implementation uses the existence of separate lock files (really a directory) to control access to the database. This works on just about every filesystem imaginable. But there are serious downsides:

(1) There is zero concurrency. A single reader blocks all other connections from reading or writing the database.

(2) An application crash or power loss can leave stale lock files sitting around that need to be cleared manually.

Nevertheless, a dotlock is an appropriate locking mode for use if no other locking strategy is available.

(https://github.com/sqlite/sqlite/blob/a42281a312459d266d4a2189c43b80e83d2ab8e7/src/os_unix.c#L2201-L2221)

The first point about only supporting exclusive locking is quite a big deal for us (we do a lot of concurrent nix operations on a shared store). From our experience, the "unix" strategy (POSIX locks) works fine, so downgrading to such a basic locking strategy really isn't desirable. The second point about maintenance in case of stale lock files is also pertinent for us.

From our perspective, although we can of course maintain a patch to nix to revert the VFS used to "unix", we'd much rather if there was at least an option to control this via configuration. Ideally, this option would default still to "unix" even when WAL is disabled, so that users don't have to pay the performance penalty if they don't need it. It's definitely debatable what the right default is now though, since any usage of more recent Nix versions in the wild with WAL disabled will now be using "unix-dotfile", and any change back to "unix" will cause further compatibility issues. At a minimum, I think we should document the tradeoffs here and highlight that users with WAL disabled may wish to review their settings.

We're happy to contribute code or documentation fixes that would help here. As always, many thanks to all contributors for their hard work on Nix!

thufschmitt commented 1 year ago

Wow, must have been an interesting debugging trip.

Yes, I think having this as a configuration option sounds fair.

Since using wal and the unix-* locking methods seem to be exclusive in practice, would it make sense to have a single option (sqlite-integrity-method?) that could be set to either wal, lock:unix or lock:unix-dotfile or something like that?

gbpdt commented 1 year ago

Thanks @thufschmitt . That would certainly work for us.

One alternative might be to expose the sqlite VFS option more directly rather than tying it directly to WAL, e.g. have an sqlite-vfs option which defaults to unix when use-sqlite-wal is true, and defaults to unix-dotfile when use-sqlite-wal is false, to preserve current behaviour. That way we can just set use-sqlitel-wal=false; sqlite-vfs=unix to get our desired behaviour back. This would also allow people to choose other VFS options if they wish (https://www.sqlite.org/vfs.html).

Happy to draft a PR based on one of these options.