Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent

GNU General Public License v3.0

625 stars 56 forks source link

RFC: bees startup and distro integration #244

Open Zygo opened 1 year ago

Zygo commented 1 year ago

This is collecting a few notes together on how bees should work, as opposed to how it currently does work.

How does it work now?

Currently there are some scripts which make key assumptions:

When users install bees, they want it to start working immediately
Users want bees to run on every btrfs filesystem their host encounters
We can guess how much RAM users want to use
We can guess what the right mount options are
Users enjoy referring to btrfs filesystems by UUID
Users are unwilling to answer one or two basic questions during installation
We can figure everything out at run time while the system is booting

but there are problems:

Quite a few users don't want that, and no other btrfs deduper works that way
There are definitely some filesystems that bees shouldn't run on with default options, e.g. removable media controlled by hostile users
There is no single RAM size, or RAM size computation function that works for everyone, or even a majority of users
It turns out to be hard to discover what mount options to use at runtime for an arbitrary filesystem, especially if it's already mounted in namespaces we can't see
Some users want absolutely static configuration, including names for their filesystems
One or two questions are fine for non-automated installations, and it should be possible to answer the questions after installation
If we get something wrong, there's no way to fix it without editing a script, and sometimes the runtime environment is unfriendly (e.g. no writable /tmp that a shell script could use as a scratch space).

How should it work?

Users should opt in to running bees

Merely installing the package on a general-use distro should not cause bees to run without user opt-in. Opt-in could be a question asked during the install, or using the distro's automation infrastructure to provide an answer, or deriving the answer from optional metapackage dependencies. e.g. the bees-autorun package might depend-on the bees package, where the bees-autorun contains a default startup script and bees contains the bees binary, and users opt into running bees by default at startup by installing bees-autorun.

If the distro is specialized (e.g. designed for NAS builds) then the opt-in could be implied by the choice to use the specialized distro.

Multiple btrfs is a thing

Each btrfs filesystem mounted on a host may have different constraints, where some are suitable for dedupe and some are not.

Each btrfs filesystem requires a separate bees instance, which multiplies resource usage (especially RAM cache size) accordingly.

A user with multiple filesystems might reasonably expect to choose which filesystems to run bees on, and which not.

Private namespaces are good

Ideally we fully sandbox bees so it can't access anything outside of the filesystem it's deduping (and also BEESHOME because it might be on some other FS). This is relatively easy if we have the service mount the root subvol in a private namespace for bees to work on, and arguably bees should start doing that itself if we can figure out a way to make that happen without crashing in libc.

Public namespaces are also good

Ideally, if the user already has subvol=/ mounted somewhere (e.g. for backups or btdu), bees would be able to simply reuse that mount point.

Tracking mounts is hard, especially in the general case

Many users want to be able to mount and umount their filesystems[citation needed]. Ideally, we'd be able to start bees after the mounting and stop bees just before the umounting (though just after can also work in some cases).

One possible model is that we'd track all of the points where a filesystem is mounted, excluding the one bees is mounted on, across all namespaces including the private ones, and stop the service when bees holds the last remaining mount point of the filesystem. It's easy to tell when a filesystem is completely umounted, but hard to tell when exactly one mount point remains (or N mount points, if there are N tools like bees that users want to run at the same time, and they can't all share the mount point for some reason). This isn't a good model--only the kernel knows all the places where a btrfs filesystem is mounted, and bees can add delays on umount, especially for a "clean" (SIGTERM) bees shutdown where we might significantly delay umount for a slow removable disk.

Another model is to have the user bring the filesystem to bees. In this model, when users want bees to run on a filesystem, they must arrange for the filesystem to be mounted at /run/bees/UUID/mount. The bees service would then start and stop depending on whether that mount point was mounted, and normal inter-service dependencies can control the mount point.

Service dependencies could also work the other way, triggering the bees mount point and the bees instance when some other filesystem is mounted or umounted. This might be a model that distro maintainers can adapt to the quirks of their installer's filesystem layout, or advanced users might construct themselves. This suggests that we should stick to a common "run one bees instance" script and keep it loosely integrated with different bees instance generator scripts that can be swapped out for common use cases like "run everywhere by default" or "run on / and /home but not /var/lib/docker" or "only run on /var/lib/docker".

bees must be careful to avoid changing btrfs mount options unintentionally

bees could host an upstream configuration tool which takes a btrfs mount point and sets up /etc/fstab or systemd mount units for it. Using a mount point (as opposed to the filesystem UUID) means the tool can copy the options from the user's existing mount point, which means a lower chance of getting them wrong (e.g. accidentally changing whether compression or autodefrag is enabled for the filesystem). On the other hand, now there are two copies of the mount options in the system to maintain.

Users should have full control over the configuration after a default config is created

We can provide a tool which automatically generates service configuration, but once generated, the user should be able to edit or correct the output of the tool, or send a PR to improve the tool to make better configurations. This implies a compilation model rather than a framework model.

Users should choose the hash table size and RAM usage

bees's constant-size hash table is a central feature of the bees design--it uses the same amount of RAM no matter how large the filesystem gets. Different users will want different size-performance trade-offs (e.g. 1% for casual users on their laptops, 25% for dedicated storage boxes with huge RAID arrays) and therefore different hash table sizes.

We could adopt the PostgreSQL model, where the package installs a configuration with a minimal RAM cache size, a few dozen MB at most. Most users will increase it to something larger, typically 10-75% of RAM depending on how important DBMS performance is relative to other tasks in the host's workload, divided by the number of instances they intend to run.

We can also improve bees's efficiency to make better use of tiny hash table sizes, so that an affordable size like 16M could be "good enough" for casual users with a half-filled single 1TB SSD.

mjt0k commented 1 year ago

Thank you for the write-up. I think it is actually much simpler than this, or might be to begin with.

I completely agree it should NOT be like install-and-it-all-started-deduping. This is wrong in my view, users should configure each bees instance separately. No, I don't think extra complications like an additional "bees-autorun"-type package is needed, - it is trivial to run `systemctl enable bees@UUID' or somesuch — which should take care of everything.

How I think it should look like.

While it should not start for every possible btrfs filesystem automatically, it should start with some sane default configuration automatically when enabled. In other words: do not auto-enable it for everything, but run it with some default parameters when user asks to do so. Current beesd insists on a configuration file for each filesystem — this is what I want to avoid. Instead, it should start with default values for a given filesystem, with a possibility to add a small tweak later.
When initially configuring an instance, it should be possible to specify some parameters for it, especially for the database size. I view it like bees --configure --db-size=foo ... $UUID — it will create the systemd service instance if not exists already, it will create the bees database of the given size (or changes the size if it already exists), and it will record other options into /etc/bees/$UUID.conf.
The same command can be used to REconfigure the database size. To prevent an active bees instance from damage this command should lock the file before resizing it, and notify the running instance about the changes.
Mount. So far I think it is okay to do /run/bees/$UUID/mount automatic mounting, in main (non-private) namespace, for the time being. This mount is visible so it should be possible to find out why the filesystem is busy when other mountpoints are unmounted, for one. This mount can be used to adjust the database size (above). Maybe we can do something with it in the future, but for now I think it is okay to make it visible.
The mount should be unmounted when bees stops. Either with ExecStop command in the systemd unit, or by bees itself when it exits (the latter is a bit more fragile as it can die due to a bug, leaving the mount hanging around).
Always use the extra mount in /run/bees/$UUID/mount/ by default for now — the logic for finding an existing root is fragile at best and will change when user modifies their system. It can be overridden with --root option (recorded in the config file!), in this case mounting is not needed.
The configuration file(s). Current situation is rather illogical and annoying. Configuration file should be optional, because the only actually needed configuration (besides the enabling of the service for the given UUID) is the database size, which is done by creating the file, there's no need to specify its size in config file. Currently beesd insists on a file with UUID=foo in /etc/bees/*.conf, — at least it can read /etc/bees/$UUID.conf directly (after reading /etc/bees/bees.conf) and not require UUID= at all.
There's actually nothing to specify in configuration most of the time. The DB size — it is the file created, to reconfigure run bees --configure --db-size=foo to do it in a safe way. Other parameters (--no-timestamps should be the default if run from systemd, debugging should be quieted by default) should have sane defaults. BEESHOME (it should be a command-line argument not a variable!) is one of the parameters which can be specified in the command line if needed.
There's no need for the wrapper script at all. Yes it is easier to read the conf file with shell, but with just a few vars in there (not options but vars! - maybe named after options) it is trivial to read it from main bees executable too. This way bees becomes self-contained entirely. And with these I'm not even sure default config is needed, — is there a need to set, say, global VERBOSE= option?

The plan is to implement the necessary options (--home, --root, defaults to /run/bees/$UUID). Implement --configure to create .beeshome (or --home) and set size (later to notify running daemon, or just to refuse to operate if it is running) — not necessary to enable the systemd service, this can be done explicitly. Implement reading /etc/bees/$UUID.conf for ROOT/HOME/VERBOSE/etc stuff (and writing a skeleton file at --configure). Choose some sane db size for --configure instead of the current 128Mb. All this is trivial. Maybe BEESHOME should be named --db or --databse.

What I want to achieve is to have bees self-contained by itself and for each instance: so it can be run either from command line (looking for everything in /etc/bees/uuid.conf) or from a systemd or other unit without a painful way to specify parameters in /etc/systemd/system/bees@UUID.conf.d/foo.conf :)

Drop beesd, move bees to /usr/sbin/, and adjust the systemd unit file for real — currently it has lots of cruft and lacks many actuall needed things.

mjt0k commented 1 year ago

I wrote an PoC bees wrapper script which does most of the above, at http://www.corpit.ru/mjt/tmp/bees .

It is supposed to be a temporary wrapper around actual bees executable (to be put to /usr/sbin/bees, actual executable to /usr/libexec/bees), until we're satisfied with the result and can move its complete functionality into bees itself, so actual executable will be /usr/sbin/bees without any other wrappers.

What this wrapper does:

New --uuid= and --rootfs= options. One of them is required, both can be specified.
If --uuid is specified, ROOTFS, BEESHOME and OPTIONS variables are read from optional /etc/bees/$uuid.conf, and verified that command-line --rootfs and --beeshome, if given, matches those (or command line can override the conffile?).
If only --uuid is given (without --rootfs or ROOTFS in conffile), we mount root of the filesystem in /run/bees/$uuid/mount/, or in $RUNTIME_DIR/mount/ (can be used with RuntimeDirectory= in systemd.service file)
If --rootfs (or ROOTFS) is specified, we expect this is a mountpoint for the btrfs filesystem to act on, and either verify that --uuid match or read UUID from this filesystem (can also look at /etc/bees/$uuid.conf here).
New --beeshome= option (and BEESHOME conffile parameter), defaults to .beeshome subvolume in the root of the filesystem (will be created in --setup if does not exist). Can be a directory where the database files are kept.
New --setup argument, to set up the database in --beeshome. Only with --setup we create new objects or resize the database. Only the setup is performed (when --setup is specified), no daemon is run.
New argument --dbsize= to specify the size of database to create (or resize) with --setup. Accepts numbers with units too, eg 1G, 10M etc. This one can be choosen by default based on the filesystem size, instead of fixing it at 1G.
When no --setup is specified, run bees executable with the rest of the arguments (currently these are parsed by the script too, - this is to be removed with the script anyway)

So it should be possible to run this bees startup in place of the "new and improved" bees.

Maybe this script can also create a skeleton conf file with --setup too. And maybe it can enable the systemd unit too.

Variable OPTIONS is currently read from the config file (/etc/bees/$uuid.conf). I'm not sure for this one, it is probably better to have the same variables in there as the command-line options, — so it becomes THREAD_COUNT=, TIMESTAMPS or NO_TIMESTAMPS etc. Instead of the generic OPTIONS.

mjt0k commented 1 year ago

A few words about mounts in /run/bees/$UUID/mount.

It should be in separate subdirs (/run/bees/$uuid/), not like it's done now, in order to protect separate bees instances from each other and from seeing not their own filesystems (is it really important? I dunno; it might have bugs and a specially-crafted filesystem can in theory be used to trigger a bug in bees and own it?)
in order to protect the mount, /run/bees/$uuid should be mode 0700 (because the mountpoint itself gets the permissions from within the filesystem, usually 0755), so the parent dir should have restricted permissions. Protection is needed because the original filesystem might be mounted in fstab into a protected dir so that others wont be able to see it, and by mounting it into a world-accessible dir we make it world-accessible which is not good.
with protected /run/bees/$uuid/, the status file /run/bees/$uuid/status isn't world-readable anymore, which might be not good. Current scheme with separate top dir for all mounts (/run/bees/mnt/$uuid) is better in this case, so we can protect all mounts at once and still have status files readable. But it can't be used to protect different bees from each other.
Public mount is good because it makes it more obvious how the filesystem is used and why. The thing is that when you umount all other mountpoints of this fs, and try to do something with the block device, kernel will still say it is busy, and with public mount it will be visible where it is mounted.
Public mount is not good, because various monitoring tools which keeps an eye on filesystem free space and the like will be confused by another filesystem which can trigger various various actions (like running out of free space). So from this PoV, private mounts is better.

Jannik2099 commented 1 year ago

I don't see a need for bees --setup, and I would love to not have config files at all. Ideally bees would be "stateless" and the behaviour entirely dependent on the command line args, so that all the configuration resides in the systemd service / your service manager of choice. Config files should only be necessary for programs where the configuration is vast and/or the configuration syntax has actions, variables etc. instead of a simple key=value pattern (see e.g. apache, nginx, openssh)

Similarly, I do not think bees should roll it's own logic for namespaces etc. - just leave that to the operator, please (systemd isolation, firejail, containers etc.) - doing this on your own is needlessly complicated and more often than not will break in spectacular ways on weird quirks that the mentioned dedicated tools have already worked around.

In summary, just give bees a --uuid / --root option for specifying the fs to dedupe, an optional --beeshome should BEESHOME reside somewhere else, and a --dbsize that creates and resizes the db as needed.

with protected /run/bees/$uuid/, the status file /run/bees/$uuid/status isn't world-readable anymore, which might be not good.

I am not sure how much of an issue that is in practice. Just tell your monitoring script how to find the new paths?

Public mount is good ... Public mount is not good

Bees should support both and the decision be made by the operator. I think using mount namespaces is a sane default for the reference service file.

mjt0k commented 1 year ago

I don't see a need for bees --setup, and I would love to not have config files at all. Ideally bees would be "stateless" and the behaviour entirely dependent on the command line args, so that all the configuration resides in the systemd service / your service manager of choice. Config files should only be necessary for programs where the configuration is vast and/or the configuration syntax has actions, variables etc. instead of a simple key=value pattern (see e.g. apache, nginx, openssh)

Here we obviously disagree. The prob with providing settings in environment or command line — eg in a systemd unit file — already forces an operator to create a config file (in the case of systemd, it is either an ugly /etc/systemd/system/beesd@$uuid.d/override.conf, or copy of original bees@.service with extra config which is even uglier). And it makes the configuration in an unexpected place.

Why I kept /etc/bees/$uuid.conf and think this is a good idea: you can run things from command line to test and be sure bees receives the same configuration when run from command line or as a system service. Especially it should be welcome by Zygo himself, to be able to re-run stuff manually for testing. The same is true for any possible local debugging.

Please note the config file is optional, unlike the current beesd script wants.

The important config parameters are the --rootfs and --beeshome. These should be found somewhere and should match when you run it manually or as a service. Others aren't really config options (verbosity level, threads configuration) but it still should be possible to specify these somewhere (without a need to override systemd service file in an ugly way), — if we do have a config file, it's a natural place to do that.

The --setup thing is needed, in my view, and is very important to distinguish between when one want to (re)configure things or just run things. Don't create subvolumes/directories, don't resize things, — in other words, don't mess with the system in an unexpected way — unless especially asked to do that. This is a principle of least surprise. This way, when explicitly distinguishing between --setup and regular run, we know where to apply --dbsize= for one, because existing file (size) is already a configuration, we shouldn't have --dbsize= anywhere in normal operations.

Imagine you forgot to specify --beeshome or --dbsize when run the tool for debugging, it created the (hidden!) .beeshome subvolume and started from scratch (you now have to undo the damage), or it resized the db you wanted to debug. Or you typoed --beeshome and it created the new one. That's demonstration of 2 aspects: the need for explicit --setup (so it doesn't create/reconfigure stuff on normal run), and the need to be able to find configuration for this filesystem no matter how it is being run, ie, a need for a config file. With config file and BEESHOME= specified in there, all this just works. (--rootfs is not that dangerous but it is still good to be able for it to work automatically when started from comnand line).

Adding config file support is trivial (as shown in my script), it is entirely optional, it resembles the current practice, and it makes some things easier and more robust.

Similarly, I do not think bees should roll it's own logic for namespaces etc. - just leave that to the operator, please (systemd isolation, firejail, containers etc.) - doing this on your own is needlessly complicated and more often than not will break in spectacular ways on weird quirks that the mentioned dedicated tools have already worked around.

This is also a though one. For now, I haven't used any configuration or setup for the namespaces, just because I'm not sure how this will work for now.

with protected /run/bees/$uuid/, the status file /run/bees/$uuid/status isn't world-readable anymore, which might be not good.

I am not sure how much of an issue that is in practice. Just tell your monitoring script how to find the new paths?

The prob is backwards. Usually a monitoring tool will find all filesystem automatically, and will be surprised by a new filesystem appearing where it should not be.

Public mount is good ... Public mount is not good

Bees should support both and the decision be made by the operator. I think using mount namespaces is a sane default for the reference service file.

I don't yet know which default is sane in this context (private or not). But this might be thought later either way. Be it done at systemd level (or at other system startup manager level), or implemented in bees internally, or any other way to do it — this can be done at the second stage. The script I proposed will work either way.

For now though, I'd leave it without any fancy namespacing, to see how it all works out. Again, as a principle of least surprise if nothing more.

Zygo commented 1 year ago

I don't think extra complications like an additional "bees-autorun"-type package is needed, - it is trivial to run `systemctl enable bees@UUID' or somesuch — which should take care of everything.

I was thinking this would defer some of the decision requirements, following the model of cryptsetup where there's cryptsetup-bin for the tools, and cryptsetup-run for startup scripts, and cryptsetup-initramfs for initramfs integration, and there could easily be more if they were needed without having to modify the package for the upstream binaries.

bees clearly doesn't need initramfs integration and today doesn't need a separate startup package either, but that would be a logical next step at some point at some point in the future. That would allow for policy selection through the package manager or support for private vs public namespaces, or integration into someone else's mature sandboxing architecture.

Upstream has to support all of these cases, but distros can make choices--one of which is to not use systemd at all.

Zygo commented 1 year ago

Config files should only be necessary for programs where the configuration is vast and/or the configuration syntax has actions, variables etc. instead of a simple key=value pattern (see e.g. apache, nginx, openssh)

Historically bees had very few configuration options and I've been able to squeeze the crawl state into a handful of backward-compatible integers, but that era is coming to an end. https://github.com/Zygo/bees/issues/205#issuecomment-979755793 has some of my thoughts from last year on the future of bees configuration. A year later, I'm not as enthusiastic about putting a .conf in .beeshome, but the general ideas of unifying the command line with config files and being able to consume more than one still hold.

To reach the high and low ends of the scalability spectrum, there are some 20 or 30 parameters to set, including the existing constants in bees.h, plus new custom rule definitions for slicing the filesystem up into streams by work type, and allocating the streams to worker threads at different priorities. One of the most-requested features is a way to specify what not to dedupe, and that could be a page of wildcards or regular expressions for path-based selection, or rules to select extent-by-extent based on properties the bees engine calculates.

A host might have SSD and HDD filesystems, with one .conf that has common rules for all, one that has rules for SSDs, one for HDDs, then the user can assemble specific .confs for each filesystem which refer to these. Other packages might conceivably want to throw in a /etc/bees/conf.d/package.conf to add exclude rules.

--setup is a good start, but there's 3 distinct cases for persistent state in beeshome:

There's no beeshome. We need permission to create one based on the current configuration (which is defined partly on the command line, partly embedded in the bees binary, and partly in a file supplied by the distro).
beeshome exists, but its persistent state doesn't match the current configuration. That's a semantic conflict we have to resolve somehow: stop and complain to the user to fix the discrepancy, ignore the current configuration and keep using the old one, or do a conversion between the current beeshome state (e.g. hash table layout) and what the new configuration says it should be.
beeshome exists, and it matches the current configuration, so we can go ahead and run.

So we'd need --setup to get out of case 1, and also something like --upgrade to get out of case 2.

Case 2 also comes up if the bees defaults change, e.g. switching to a better scan mode, when the user doesn't choose one explicitly. Currently we're stuck on one hash function because we haven't picked a way to coordinate with existing users to change to a better one.

Zygo commented 1 year ago

It should be in separate subdirs (/run/bees/$uuid/), not like it's done now, in order to protect separate bees instances from each other and from seeing not their own filesystems (is it really important? I dunno; it might have bugs and a specially-crafted filesystem can in theory be used to trigger a bug in bees and own it?)

bees does a lot here:

opening everything on the target filesystem O_RDONLY for dedupe or O_TMPFILE for extent split copies
verifying the expected subvol and inode on the fd match expected values after opening
using openat() and O_NOFOLLOW to reduce symlink attack surface

but bees could do more because there are two attacks left:

bees can still be tricked into briefly opening files outside of the filesystem, which is a problem when the open and close calls have side-effects all by themselves, like rewinding tape devices and injecting EOFs into named FIFOs. It looks like this could be fixed by using O_PATH to fully separate the "resolve the path to a specific inode object" and "turn the inode object into an open file descriptor" phases into two system calls, and then we'd have bees do the subvol+inode verification between those two calls.
long directory paths are still traversed with symlinks enabled, because O_NOFOLLOW doesn't turn those off. This can be exploited to exfiltrate private path names from the host by using a symlink near the beginning of the path to force traversal though a NFS or CIFS filesystem where it could be observed by an attacker. I don't know of any practical way to exploit that (especially not one that doesn't assume the attacker already has sufficient privilege to attack the host in much more straightforward ways) but it's there.

Both issues can be prevented at the expense of more complexity and runtime cost in bees, but both issues can also be prevented from outside by dropping bees into an empty namespace where it can only reach the target filesystem, $BEESHOME, the C++ runtime, and a carefully curated subset of /proc.

in order to protect the mount, /run/bees/$uuid should be mode 0700 (because the mountpoint itself gets the permissions from within the filesystem, usually 0755), so the parent dir should have restricted permissions. Protection is needed because the original filesystem might be mounted in fstab into a protected dir so that others wont be able to see it, and by mounting it into a world-accessible dir we make it world-accessible which is not good.

This causes df to complain about denied permission if the mount isn't private (the path still shows up in /proc/mounts). That can break monitoring tools (though maybe not the mature/popular ones?).

with protected /run/bees/$uuid/, the status file /run/bees/$uuid/status isn't world-readable anymore, which might be not good. Current scheme with separate top dir for all mounts (/run/bees/mnt/$uuid) is better in this case, so we can protect all mounts at once and still have status files readable. But it can't be used to protect different bees from each other.

I'm a fan of putting status in a directory (i.e. /run/bees/status/$uuid/status). There's some currently hidden internal state that is worth exporting for debugging (like the Task dependency tree and split extent work queues) and it shouldn't all go into one giant file. Even the live stats and thread list should be two separate files (/run/bees/status/$uuid/stats, /run/bees/status/$uuid/threads).

Zygo commented 1 year ago

I haven't really looked at the namespace API to see how hard it would be for bees to sandbox itself. It doesn't seem that bad:

open the target fs root FD
open $BEESHOME FD
create an anonymous unshared private mount namespace
mount empty ramfs on /
create /proc
mount /proc

...end of list? bees uses openat and we only need the FDs at step 1 and 2 to access the filesystem. Once bees has opened the target filesystem's root, bees doesn't need anything to be mounted in its chroot except /proc.

If bees ever needed to access btrfs /dev nodes, we could open them all at startup and keep them as long as bees runs. That's a long way in the future, though. A restart would be needed to add new devices, or we keep one worker thread in the parent namespace so it can open new devices when they show up.

If the user didn't give us the subvol root FD then it's a little harder:

open some FD on the target fs
open $BEESHOME FD
create an anonymous private mount namespace
mount empty ramfs on /
create /proc
mount /proc
create /fsroot
bind-mount the FD at 1 to /fsroot
look at /proc/mounts or /proc/self/mountinfo to get the flags for /fsroot
umount /fsroot
mount the filesystem with the mount options, subvol=/, on /fsroot
open FD for /fsroot
make sure it's the same filesystem UUID
close FD from step 1
umount /fsroot (lazy)
run dedupe with the FD from /fsroot as root, and the FD from $BEESHOME in step 2

...or something like that. Trying to avoid a findmnt dependency.

Jannik2099 commented 1 year ago

I haven't really looked at the namespace API to see how hard it would be for bees to sandbox itself. It doesn't seem that bad:

Please don't.

If a sysadmin wants to isolate a program, they use any of the mentioned established tooling, and it works and behaves the same for just about any program. Having bees roll its own crap adds no value, much confusion, and even more incompatibilities and broken edge cases. It's similar spirit to writing your own build system.

What would this add over just using established methods like with pretty much any other software?

Zygo commented 1 year ago

External methods at best will still leave bees with access to unnecessary things while it's running, because they can't remove access to bees (or whatever the C++ runtime needs at startup, like zoneinfo and /sys/devices) itself.

The main reason bees doesn't simply chroot itself today and close all the outstanding security problems is because there's no reliable empty path it could chroot to and still have /proc available for libc (well, that, and $BEESSTATUS, but I could fix that in a few minutes). An anonymous private mount namespace solves that problem: bees can create its own empty directory, it gets restricted access to /proc, and the kernel will automatically clean it all up after us. Conveniently it also removes the need for the --strip-paths option--inside the sandbox, the paths start from the first visible directory, the root of the filesystem.

Internal sandboxing is a supplement to external sandboxing, not a replacement. You could start up bees in a namespace that contains only the bees binary and the target filesystem. The internal sandboxing would start from there, and drop access to the bees binary once it's running. External sandboxing can't remove the privileges that allow internal sandboxing to work, because they're also needed for the btrfs ioctls. If the btrfs ioctl privileges were lowered then maybe this would become a problem.

There's no impact on ability to umount the filesystem. It turns out that even if you lazy-umount the filesystem, bees will happily keep running on it, because it does everything through openat and the root FD. Only the paths in the log messages get shorter (because they aren't prefixed by the mount point any more). If you non-lazy-umount the filesystem, the umount fails because bees has open FDs. No amount of sandboxing, including zero, changes either effect. When you umount a filesystem, you somehow have to know bees is running on it, and terminate bees. Until you do, the umount won't happen.

If the sandboxing fails, nothing else changes--bees can still work, it will just work with more access than it needs. That's roughly equivalent to the existing call to mlockall in terms of risk to bees. I suppose we might want an option to have bees halt if its internal sandbox fails, for users who use only the internal sandbox, or the (probably very small) intersection of users who are paranoid about potential symlink attacks but not paranoid about dedupe-related infoleaks. Definitely there also needs to be an 'off' switch for debugging, as valgrind really hates being chrooted.

There is still the possibility of new external incompatibilities, like systemd flipping the default kernel mount subtree policy from private to shared. That kind of change is pretty disastrous--it broke our 10-year-old software even when it did use the established methods, because systemd unilaterally decided to ignore the established methods. Nothing we can do about systemd except run CI testing and adapt to the new broken.

There is a bug in the plan I have above--the new namespace still has all its original mount points underneath, they're just no longer visible. That would kill the whole idea if there isn't an easy solution for it.

Jannik2099 commented 1 year ago

Here we obviously disagree. The prob with providing settings in environment or command line — eg in a systemd unit file — already forces an operator to create a config file (in the case of systemd, it is either an ugly /etc/systemd/system/beesd@$uuid.d/override.conf, or copy of original bees@.service with extra config which is even uglier). And it makes the configuration in an unexpected place.

I have reconsidered and changed my opinion, I don't think there's any problem with config files as long as using them is not necessary for basic usage. Exposing the config options from bees.h in a config file would be very welcome though.

I don't mind --setup anymore, it just means people who want bees to "just work" will have to add an ExecStartPre=bees --setup to their stuff

mjt0k commented 1 year ago

I don't mind --setup anymore, it just means people who want bees to "just work" will have to add an ExecStartPre=bees --setup to their stuff

Or it can be bees --setup --systemd instead (with --systemd bees can run systemctl enable bees@$uuid.service) :)

Jannik2099 commented 1 year ago

programs invoking systemctl is a hard no, not just because it's kinda the wrong way around but also because it would violate any sensible LSM profile

Jannik2099 commented 1 year ago

Noting down some insights from #btrfs:

If bees --setup requires significantly more privileges than the regular bees operation, it may be beneficial to instead have a seperate bees-setup binary so that LSMs can apply a different, more permissive profile to it instead of bloating the bees permission set.

mjt0k commented 1 year ago

If bees --setup requires significantly more privileges than the regular bees operation, it may be beneficial to instead have a seperate bees-setup binary so that LSMs can apply a different, more permissive profile to it instead of bloating the bees permission set.

Yes, this is definitely a valid point. So guess we'll end up with a separate bees-setup script (it doesn't need to be a binary, a shell script will do just fine)

Zygo commented 1 year ago

A few notes on the script:

the --setup parts should be moved to their own script as discussed above
the crawl data file is beescrawl.dat. Nobody has a beescrawl.$uuid.dat any more, that feature was killed in 2016

A few notes for me:

it might be a good idea to have bees call flock on the hash table to prevent concurrent execution. When it happens accidentally, bees can run concurrently with itself for quite a long time before anyone notices! Setup scripts can use flock(1) to avoid blowing up a running daemon.

mjt0k commented 1 year ago

So we're actually back to current beesd startup script and /usr/libexec/bees executable (I think the proper place for it is/usr/libexec/bees). This makes good sense.

I'm not sure we need two scripts (one for setup and another for startup), one can do it, because many functions in there are the same (finding the root filesystem, checking it is actually btrfs, etc). The only thing I dislike in the script I wrote in this context is the need to repeat all options for getopt. Maybe /usr/libexec/bees can be asked to provide all options it recognizes for usage with getopt(1), — this way there will be no need to repeat all the options in beesd too. But this is not mandatory, it's not that it has a ton of options which changes twice a day.

Thaodan commented 1 year ago

Regarding sandboxing: Would more systemd sandboxing be middle ground compared to a full container?

About starting bees for every fs automatically: I don't think that is a good option since the fs might be io starved, e.g. starting bees on spinning drive that has big chunks of data to go through.