Open Zygo opened 1 year ago
Thank you for the write-up. I think it is actually much simpler than this, or might be to begin with.
I completely agree it should NOT be like install-and-it-all-started-deduping. This is wrong in my view, users should configure each bees instance separately. No, I don't think extra complications like an additional "bees-autorun"-type package is needed, - it is trivial to run `systemctl enable bees@UUID' or somesuch — which should take care of everything.
How I think it should look like.
bees --configure --db-size=foo ... $UUID
— it will create the systemd service instance if not exists already, it will create the bees database of the given size (or changes the size if it already exists), and it will record other options into /etc/bees/$UUID.conf.The plan is to implement the necessary options (--home, --root, defaults to /run/bees/$UUID). Implement --configure to create .beeshome (or --home) and set size (later to notify running daemon, or just to refuse to operate if it is running) — not necessary to enable the systemd service, this can be done explicitly. Implement reading /etc/bees/$UUID.conf for ROOT/HOME/VERBOSE/etc stuff (and writing a skeleton file at --configure). Choose some sane db size for --configure instead of the current 128Mb. All this is trivial. Maybe BEESHOME should be named --db or --databse.
What I want to achieve is to have bees self-contained by itself and for each instance: so it can be run either from command line (looking for everything in /etc/bees/uuid.conf) or from a systemd or other unit without a painful way to specify parameters in /etc/systemd/system/bees@UUID.conf.d/foo.conf :)
Drop beesd, move bees to /usr/sbin/, and adjust the systemd unit file for real — currently it has lots of cruft and lacks many actuall needed things.
I wrote an PoC bees wrapper script which does most of the above, at http://www.corpit.ru/mjt/tmp/bees .
It is supposed to be a temporary wrapper around actual bees executable (to be put to /usr/sbin/bees, actual executable to /usr/libexec/bees), until we're satisfied with the result and can move its complete functionality into bees itself, so actual executable will be /usr/sbin/bees without any other wrappers.
What this wrapper does:
--uuid=
and --rootfs=
options. One of them is required, both can be specified.--uuid
is specified, ROOTFS
, BEESHOME
and OPTIONS
variables are read from optional /etc/bees/$uuid.conf
, and verified that command-line --rootfs
and --beeshome
, if given, matches those (or command line can override the conffile?).--uuid
is given (without --rootfs
or ROOTFS
in conffile), we mount root of the filesystem in /run/bees/$uuid/mount/
, or in $RUNTIME_DIR/mount/
(can be used with RuntimeDirectory= in systemd.service file)--rootfs
(or ROOTFS
) is specified, we expect this is a mountpoint for the btrfs filesystem to act on, and either verify that --uuid
match or read UUID from this filesystem (can also look at /etc/bees/$uuid.conf
here).--beeshome=
option (and BEESHOME
conffile parameter), defaults to .beeshome
subvolume in the root of the filesystem (will be created in --setup
if does not exist). Can be a directory where the database files are kept.--setup
argument, to set up the database in --beeshome
. Only with --setup
we create new objects or resize the database. Only the setup is performed (when --setup
is specified), no daemon is run.--dbsize=
to specify the size of database to create (or resize) with --setup
. Accepts numbers with units too, eg 1G, 10M etc. This one can be choosen by default based on the filesystem size, instead of fixing it at 1G.--setup
is specified, run bees
executable with the rest of the arguments (currently these are parsed by the script too, - this is to be removed with the script anyway)So it should be possible to run this bees startup in place of the "new and improved" bees.
Maybe this script can also create a skeleton conf file with --setup
too. And maybe it can enable the systemd unit too.
Variable OPTIONS
is currently read from the config file (/etc/bees/$uuid.conf
). I'm not sure for this one, it is probably better to have the same variables in there as the command-line options, — so it becomes THREAD_COUNT=
, TIMESTAMPS
or NO_TIMESTAMPS
etc. Instead of the generic OPTIONS
.
A few words about mounts in /run/bees/$UUID/mount.
/run/bees/$uuid/
), not like it's done now, in order to protect separate bees instances from each other and from seeing not their own filesystems (is it really important? I dunno; it might have bugs and a specially-crafted filesystem can in theory be used to trigger a bug in bees and own it?)/run/bees/$uuid
should be mode 0700 (because the mountpoint itself gets the permissions from within the filesystem, usually 0755), so the parent dir should have restricted permissions. Protection is needed because the original filesystem might be mounted in fstab into a protected dir so that others wont be able to see it, and by mounting it into a world-accessible dir we make it world-accessible which is not good./run/bees/$uuid/
, the status file /run/bees/$uuid/status
isn't world-readable anymore, which might be not good. Current scheme with separate top dir for all mounts (/run/bees/mnt/$uuid
) is better in this case, so we can protect all mounts at once and still have status files readable. But it can't be used to protect different bees from each other.I don't see a need for bees --setup
, and I would love to not have config files at all. Ideally bees would be "stateless" and the behaviour entirely dependent on the command line args, so that all the configuration resides in the systemd service / your service manager of choice. Config files should only be necessary for programs where the configuration is vast
and/or the configuration syntax has actions, variables etc. instead of a simple key=value pattern (see e.g. apache, nginx, openssh)
Similarly, I do not think bees should roll it's own logic for namespaces etc. - just leave that to the operator, please (systemd isolation, firejail, containers etc.) - doing this on your own is needlessly complicated and more often than not will break in spectacular ways on weird quirks that the mentioned dedicated tools have already worked around.
In summary, just give bees a --uuid / --root
option for specifying the fs to dedupe, an optional --beeshome
should BEESHOME reside somewhere else, and a --dbsize
that creates and resizes the db as needed.
with protected /run/bees/$uuid/, the status file /run/bees/$uuid/status isn't world-readable anymore, which might be not good.
I am not sure how much of an issue that is in practice. Just tell your monitoring script how to find the new paths?
Public mount is good ... Public mount is not good
Bees should support both and the decision be made by the operator. I think using mount namespaces is a sane default for the reference service file.
I don't see a need for
bees --setup
, and I would love to not have config files at all. Ideally bees would be "stateless" and the behaviour entirely dependent on the command line args, so that all the configuration resides in the systemd service / your service manager of choice. Config files should only be necessary for programs where the configuration is vast and/or the configuration syntax has actions, variables etc. instead of a simple key=value pattern (see e.g. apache, nginx, openssh)
Here we obviously disagree. The prob with providing settings in environment or command line — eg in a systemd unit file — already forces an operator to create a config file (in the case of systemd, it is either an ugly /etc/systemd/system/beesd@$uuid.d/override.conf
, or copy of original bees@.service with extra config which is even uglier). And it makes the configuration in an unexpected place.
Why I kept /etc/bees/$uuid.conf and think this is a good idea: you can run things from command line to test and be sure bees receives the same configuration when run from command line or as a system service. Especially it should be welcome by Zygo himself, to be able to re-run stuff manually for testing. The same is true for any possible local debugging.
Please note the config file is optional, unlike the current beesd
script wants.
The important config parameters are the --rootfs
and --beeshome
. These should be found somewhere and should match when you run it manually or as a service. Others aren't really config options (verbosity level, threads configuration) but it still should be possible to specify these somewhere (without a need to override systemd service file in an ugly way), — if we do have a config file, it's a natural place to do that.
The --setup
thing is needed, in my view, and is very important to distinguish between when one want to (re)configure things or just run things. Don't create subvolumes/directories, don't resize things, — in other words, don't mess with the system in an unexpected way — unless especially asked to do that. This is a principle of least surprise. This way, when explicitly distinguishing between --setup
and regular run, we know where to apply --dbsize=
for one, because existing file (size) is already a configuration, we shouldn't have --dbsize=
anywhere in normal operations.
Imagine you forgot to specify --beeshome
or --dbsize
when run the tool for debugging, it created the (hidden!) .beeshome subvolume and started from scratch (you now have to undo the damage), or it resized the db you wanted to debug. Or you typoed --beeshome and it created the new one. That's demonstration of 2 aspects: the need for explicit --setup
(so it doesn't create/reconfigure stuff on normal run), and the need to be able to find configuration for this filesystem no matter how it is being run, ie, a need for a config file. With config file and BEESHOME=
specified in there, all this just works. (--rootfs
is not that dangerous but it is still good to be able for it to work automatically when started from comnand line).
Adding config file support is trivial (as shown in my script), it is entirely optional, it resembles the current practice, and it makes some things easier and more robust.
Similarly, I do not think bees should roll it's own logic for namespaces etc. - just leave that to the operator, please (systemd isolation, firejail, containers etc.) - doing this on your own is needlessly complicated and more often than not will break in spectacular ways on weird quirks that the mentioned dedicated tools have already worked around.
This is also a though one. For now, I haven't used any configuration or setup for the namespaces, just because I'm not sure how this will work for now.
with protected /run/bees/$uuid/, the status file /run/bees/$uuid/status isn't world-readable anymore, which might be not good.
I am not sure how much of an issue that is in practice. Just tell your monitoring script how to find the new paths?
The prob is backwards. Usually a monitoring tool will find all filesystem automatically, and will be surprised by a new filesystem appearing where it should not be.
Public mount is good ... Public mount is not good
Bees should support both and the decision be made by the operator. I think using mount namespaces is a sane default for the reference service file.
I don't yet know which default is sane in this context (private or not). But this might be thought later either way. Be it done at systemd level (or at other system startup manager level), or implemented in bees internally, or any other way to do it — this can be done at the second stage. The script I proposed will work either way.
For now though, I'd leave it without any fancy namespacing, to see how it all works out. Again, as a principle of least surprise if nothing more.
I don't think extra complications like an additional "bees-autorun"-type package is needed, - it is trivial to run `systemctl enable bees@UUID' or somesuch — which should take care of everything.
I was thinking this would defer some of the decision requirements, following the model of cryptsetup
where there's cryptsetup-bin
for the tools, and cryptsetup-run
for startup scripts, and cryptsetup-initramfs
for initramfs integration, and there could easily be more if they were needed without having to modify the package for the upstream binaries.
bees clearly doesn't need initramfs integration and today doesn't need a separate startup package either, but that would be a logical next step at some point at some point in the future. That would allow for policy selection through the package manager or support for private vs public namespaces, or integration into someone else's mature sandboxing architecture.
Upstream has to support all of these cases, but distros can make choices--one of which is to not use systemd at all.
Config files should only be necessary for programs where the configuration is vast and/or the configuration syntax has actions, variables etc. instead of a simple key=value pattern (see e.g. apache, nginx, openssh)
Historically bees had very few configuration options and I've been able to squeeze the crawl state into a handful of backward-compatible integers, but that era is coming to an end. https://github.com/Zygo/bees/issues/205#issuecomment-979755793 has some of my thoughts from last year on the future of bees configuration. A year later, I'm not as enthusiastic about putting a .conf in .beeshome
, but the general ideas of unifying the command line with config files and being able to consume more than one still hold.
To reach the high and low ends of the scalability spectrum, there are some 20 or 30 parameters to set, including the existing constants in bees.h
, plus new custom rule definitions for slicing the filesystem up into streams by work type, and allocating the streams to worker threads at different priorities. One of the most-requested features is a way to specify what not to dedupe, and that could be a page of wildcards or regular expressions for path-based selection, or rules to select extent-by-extent based on properties the bees engine calculates.
A host might have SSD and HDD filesystems, with one .conf that has common rules for all, one that has rules for SSDs, one for HDDs, then the user can assemble specific .confs for each filesystem which refer to these. Other packages might conceivably want to throw in a /etc/bees/conf.d/package.conf
to add exclude rules.
--setup
is a good start, but there's 3 distinct cases for persistent state in beeshome
:
beeshome
. We need permission to create one based on the current configuration (which is defined partly on the command line, partly embedded in the bees binary, and partly in a file supplied by the distro).beeshome
exists, but its persistent state doesn't match the current configuration. That's a semantic conflict we have to resolve somehow: stop and complain to the user to fix the discrepancy, ignore the current configuration and keep using the old one, or do a conversion between the current beeshome
state (e.g. hash table layout) and what the new configuration says it should be.beeshome
exists, and it matches the current configuration, so we can go ahead and run.So we'd need --setup
to get out of case 1, and also something like --upgrade
to get out of case 2.
Case 2 also comes up if the bees defaults change, e.g. switching to a better scan mode, when the user doesn't choose one explicitly. Currently we're stuck on one hash function because we haven't picked a way to coordinate with existing users to change to a better one.
- It should be in separate subdirs (
/run/bees/$uuid/
), not like it's done now, in order to protect separate bees instances from each other and from seeing not their own filesystems (is it really important? I dunno; it might have bugs and a specially-crafted filesystem can in theory be used to trigger a bug in bees and own it?)
bees does a lot here:
O_RDONLY
for dedupe or O_TMPFILE
for extent split copiesopenat()
and O_NOFOLLOW
to reduce symlink attack surfacebut bees could do more because there are two attacks left:
open
and close
calls have side-effects all by themselves, like rewinding tape devices and injecting EOFs into named FIFOs. It looks like this could be fixed by using O_PATH
to fully separate the "resolve the path to a specific inode object" and "turn the inode object into an open file descriptor" phases into two system calls, and then we'd have bees do the subvol+inode verification between those two calls.O_NOFOLLOW
doesn't turn those off. This can be exploited to exfiltrate private path names from the host by using a symlink near the beginning of the path to force traversal though a NFS or CIFS filesystem where it could be observed by an attacker. I don't know of any practical way to exploit that (especially not one that doesn't assume the attacker already has sufficient privilege to attack the host in much more straightforward ways) but it's there.Both issues can be prevented at the expense of more complexity and runtime cost in bees, but both issues can also be prevented from outside by dropping bees into an empty namespace where it can only reach the target filesystem, $BEESHOME
, the C++ runtime, and a carefully curated subset of /proc
.
- in order to protect the mount,
/run/bees/$uuid
should be mode 0700 (because the mountpoint itself gets the permissions from within the filesystem, usually 0755), so the parent dir should have restricted permissions. Protection is needed because the original filesystem might be mounted in fstab into a protected dir so that others wont be able to see it, and by mounting it into a world-accessible dir we make it world-accessible which is not good.
This causes df
to complain about denied permission if the mount isn't private (the path still shows up in /proc/mounts
). That can break monitoring tools (though maybe not the mature/popular ones?).
- with protected
/run/bees/$uuid/
, the status file/run/bees/$uuid/status
isn't world-readable anymore, which might be not good. Current scheme with separate top dir for all mounts (/run/bees/mnt/$uuid
) is better in this case, so we can protect all mounts at once and still have status files readable. But it can't be used to protect different bees from each other.
I'm a fan of putting status in a directory (i.e. /run/bees/status/$uuid/status
). There's some currently hidden internal state that is worth exporting for debugging (like the Task dependency tree and split extent work queues) and it shouldn't all go into one giant file. Even the live stats and thread list should be two separate files (/run/bees/status/$uuid/stats
, /run/bees/status/$uuid/threads
).
I haven't really looked at the namespace API to see how hard it would be for bees to sandbox itself. It doesn't seem that bad:
$BEESHOME
FD/
/proc
/proc
...end of list? bees uses openat
and we only need the FDs at step 1 and 2 to access the filesystem. Once bees has opened the target filesystem's root, bees doesn't need anything to be mounted in its chroot except /proc
.
If bees ever needed to access btrfs /dev
nodes, we could open them all at startup and keep them as long as bees runs. That's a long way in the future, though. A restart would be needed to add new devices, or we keep one worker thread in the parent namespace so it can open new devices when they show up.
If the user didn't give us the subvol root FD then it's a little harder:
$BEESHOME
FD/
/proc
/proc
/fsroot
/fsroot
/proc/mounts
or /proc/self/mountinfo
to get the flags for /fsroot
/fsroot
/
, on /fsroot
/fsroot
/fsroot
as root, and the FD from $BEESHOME
in step 2...or something like that. Trying to avoid a findmnt
dependency.
I haven't really looked at the namespace API to see how hard it would be for bees to sandbox itself. It doesn't seem that bad:
Please don't.
If a sysadmin wants to isolate a program, they use any of the mentioned established tooling, and it works and behaves the same for just about any program. Having bees roll its own crap adds no value, much confusion, and even more incompatibilities and broken edge cases. It's similar spirit to writing your own build system.
What would this add over just using established methods like with pretty much any other software?
External methods at best will still leave bees with access to unnecessary things while it's running, because they can't remove access to bees (or whatever the C++ runtime needs at startup, like zoneinfo
and /sys/devices
) itself.
The main reason bees doesn't simply chroot
itself today and close all the outstanding security problems is because there's no reliable empty path it could chroot
to and still have /proc
available for libc (well, that, and $BEESSTATUS
, but I could fix that in a few minutes). An anonymous private mount namespace solves that problem: bees can create its own empty directory, it gets restricted access to /proc
, and the kernel will automatically clean it all up after us. Conveniently it also removes the need for the --strip-paths
option--inside the sandbox, the paths start from the first visible directory, the root of the filesystem.
Internal sandboxing is a supplement to external sandboxing, not a replacement. You could start up bees in a namespace that contains only the bees binary and the target filesystem. The internal sandboxing would start from there, and drop access to the bees binary once it's running. External sandboxing can't remove the privileges that allow internal sandboxing to work, because they're also needed for the btrfs ioctls. If the btrfs ioctl privileges were lowered then maybe this would become a problem.
There's no impact on ability to umount the filesystem. It turns out that even if you lazy-umount the filesystem, bees will happily keep running on it, because it does everything through openat
and the root FD. Only the paths in the log messages get shorter (because they aren't prefixed by the mount point any more). If you non-lazy-umount the filesystem, the umount fails because bees has open FDs. No amount of sandboxing, including zero, changes either effect. When you umount a filesystem, you somehow have to know bees is running on it, and terminate bees. Until you do, the umount won't happen.
If the sandboxing fails, nothing else changes--bees can still work, it will just work with more access than it needs. That's roughly equivalent to the existing call to mlockall
in terms of risk to bees. I suppose we might want an option to have bees halt if its internal sandbox fails, for users who use only the internal sandbox, or the (probably very small) intersection of users who are paranoid about potential symlink attacks but not paranoid about dedupe-related infoleaks. Definitely there also needs to be an 'off' switch for debugging, as valgrind really hates being chrooted.
There is still the possibility of new external incompatibilities, like systemd flipping the default kernel mount subtree policy from private to shared. That kind of change is pretty disastrous--it broke our 10-year-old software even when it did use the established methods, because systemd unilaterally decided to ignore the established methods. Nothing we can do about systemd except run CI testing and adapt to the new broken.
There is a bug in the plan I have above--the new namespace still has all its original mount points underneath, they're just no longer visible. That would kill the whole idea if there isn't an easy solution for it.
Here we obviously disagree. The prob with providing settings in environment or command line — eg in a systemd unit file — already forces an operator to create a config file (in the case of systemd, it is either an ugly /etc/systemd/system/beesd@$uuid.d/override.conf, or copy of original bees@.service with extra config which is even uglier). And it makes the configuration in an unexpected place.
I have reconsidered and changed my opinion, I don't think there's any problem with config files as long as using them is not necessary for basic usage. Exposing the config options from bees.h
in a config file would be very welcome though.
I don't mind --setup
anymore, it just means people who want bees to "just work" will have to add an ExecStartPre=bees --setup
to their stuff
I don't mind
--setup
anymore, it just means people who want bees to "just work" will have to add anExecStartPre=bees --setup
to their stuff
Or it can be bees --setup --systemd
instead (with --systemd
bees can run systemctl enable bees@$uuid.service
) :)
programs invoking systemctl
is a hard no, not just because it's kinda the wrong way around but also because it would violate any sensible LSM profile
Noting down some insights from #btrfs:
If bees --setup
requires significantly more privileges than the regular bees
operation, it may be beneficial to instead have a seperate bees-setup
binary so that LSMs can apply a different, more permissive profile to it instead of bloating the bees
permission set.
If
bees --setup
requires significantly more privileges than the regularbees
operation, it may be beneficial to instead have a seperatebees-setup
binary so that LSMs can apply a different, more permissive profile to it instead of bloating thebees
permission set.
Yes, this is definitely a valid point. So guess we'll end up with a separate bees-setup script (it doesn't need to be a binary, a shell script will do just fine)
A few notes on the script:
beescrawl.dat
. Nobody has a beescrawl.$uuid.dat
any more, that feature was killed in 2016A few notes for me:
flock
on the hash table to prevent concurrent execution. When it happens accidentally, bees can run concurrently with itself for quite a long time before anyone notices! Setup scripts can use flock(1)
to avoid blowing up a running daemon.So we're actually back to current beesd
startup script and /usr/libexec/bees
executable (I think the proper place for it is/usr/libexec/bees
). This makes good sense.
I'm not sure we need two scripts (one for setup and another for startup), one can do it, because many functions in there are the same (finding the root filesystem, checking it is actually btrfs, etc). The only thing I dislike in the script I wrote in this context is the need to repeat all options for getopt. Maybe /usr/libexec/bees
can be asked to provide all options it recognizes for usage with getopt(1), — this way there will be no need to repeat all the options in beesd
too. But this is not mandatory, it's not that it has a ton of options which changes twice a day.
Regarding sandboxing: Would more systemd sandboxing be middle ground compared to a full container?
About starting bees for every fs automatically: I don't think that is a good option since the fs might be io starved, e.g. starting bees on spinning drive that has big chunks of data to go through.
This is collecting a few notes together on how bees should work, as opposed to how it currently does work.
How does it work now?
Currently there are some scripts which make key assumptions:
but there are problems:
How should it work?
Users should opt in to running bees
Merely installing the package on a general-use distro should not cause bees to run without user opt-in. Opt-in could be a question asked during the install, or using the distro's automation infrastructure to provide an answer, or deriving the answer from optional metapackage dependencies. e.g. the
bees-autorun
package might depend-on thebees
package, where thebees-autorun
contains a default startup script andbees
contains the bees binary, and users opt into running bees by default at startup by installingbees-autorun
.If the distro is specialized (e.g. designed for NAS builds) then the opt-in could be implied by the choice to use the specialized distro.
Multiple btrfs is a thing
Each btrfs filesystem mounted on a host may have different constraints, where some are suitable for dedupe and some are not.
Each btrfs filesystem requires a separate bees instance, which multiplies resource usage (especially RAM cache size) accordingly.
A user with multiple filesystems might reasonably expect to choose which filesystems to run bees on, and which not.
Private namespaces are good
Ideally we fully sandbox bees so it can't access anything outside of the filesystem it's deduping (and also
BEESHOME
because it might be on some other FS). This is relatively easy if we have the service mount the root subvol in a private namespace for bees to work on, and arguably bees should start doing that itself if we can figure out a way to make that happen without crashing in libc.Public namespaces are also good
Ideally, if the user already has
subvol=/
mounted somewhere (e.g. for backups orbtdu
), bees would be able to simply reuse that mount point.Tracking mounts is hard, especially in the general case
Many users want to be able to mount and umount their filesystems[citation needed]. Ideally, we'd be able to start bees after the mounting and stop bees just before the umounting (though just after can also work in some cases).
One possible model is that we'd track all of the points where a filesystem is mounted, excluding the one bees is mounted on, across all namespaces including the private ones, and stop the service when bees holds the last remaining mount point of the filesystem. It's easy to tell when a filesystem is completely umounted, but hard to tell when exactly one mount point remains (or N mount points, if there are N tools like bees that users want to run at the same time, and they can't all share the mount point for some reason). This isn't a good model--only the kernel knows all the places where a btrfs filesystem is mounted, and bees can add delays on umount, especially for a "clean" (SIGTERM) bees shutdown where we might significantly delay umount for a slow removable disk.
Another model is to have the user bring the filesystem to bees. In this model, when users want bees to run on a filesystem, they must arrange for the filesystem to be mounted at
/run/bees/UUID/mount
. The bees service would then start and stop depending on whether that mount point was mounted, and normal inter-service dependencies can control the mount point.Service dependencies could also work the other way, triggering the bees mount point and the bees instance when some other filesystem is mounted or umounted. This might be a model that distro maintainers can adapt to the quirks of their installer's filesystem layout, or advanced users might construct themselves. This suggests that we should stick to a common "run one bees instance" script and keep it loosely integrated with different bees instance generator scripts that can be swapped out for common use cases like "run everywhere by default" or "run on
/
and/home
but not/var/lib/docker
" or "only run on/var/lib/docker
".bees must be careful to avoid changing btrfs mount options unintentionally
bees could host an upstream configuration tool which takes a btrfs mount point and sets up
/etc/fstab
or systemd mount units for it. Using a mount point (as opposed to the filesystem UUID) means the tool can copy the options from the user's existing mount point, which means a lower chance of getting them wrong (e.g. accidentally changing whether compression or autodefrag is enabled for the filesystem). On the other hand, now there are two copies of the mount options in the system to maintain.Users should have full control over the configuration after a default config is created
We can provide a tool which automatically generates service configuration, but once generated, the user should be able to edit or correct the output of the tool, or send a PR to improve the tool to make better configurations. This implies a compilation model rather than a framework model.
Users should choose the hash table size and RAM usage
bees's constant-size hash table is a central feature of the bees design--it uses the same amount of RAM no matter how large the filesystem gets. Different users will want different size-performance trade-offs (e.g. 1% for casual users on their laptops, 25% for dedicated storage boxes with huge RAID arrays) and therefore different hash table sizes.
We could adopt the PostgreSQL model, where the package installs a configuration with a minimal RAM cache size, a few dozen MB at most. Most users will increase it to something larger, typically 10-75% of RAM depending on how important DBMS performance is relative to other tasks in the host's workload, divided by the number of instances they intend to run.
We can also improve bees's efficiency to make better use of tiny hash table sizes, so that an affordable size like 16M could be "good enough" for casual users with a half-filled single 1TB SSD.