Zygo / bees

Best-Effort Extent-Same, a btrfs dedupe agent
GNU General Public License v3.0
630 stars 57 forks source link

Beesd script is not accepting options with arguments #205

Closed JaviVilarroig closed 2 years ago

JaviVilarroig commented 2 years ago

I have been trying to use beesd with the parameter --thread-count 2 to limit the CPU but fails. After some testing I found that it's taking the 2 as a parameter. Then it consider that is an unsupported parameter and bailing out to show the help message and stop.

I wonder if I'm missing something on how the parameters must be given to beesd.

If not I can voluteer to prepare a PR wit an improved script.

kakra commented 2 years ago

Use --thread-count=2

BTW: Options with arguments are supposed to be used with =, it may be that some utilities break that convention. Only single letter options allow to directly use the argument (without space) behind it, unless followed by other single letter options.

JaviVilarroig commented 2 years ago

Sorry, I was not detailed enough. I have already tried --tread-count=2, -c 2 and -c=2. And none of them works for me.

eightys3v3n commented 2 years ago

Added echo "${ARGUMENTS[@]}" to the second last line in beesd.

Running with the arguments --thread-count X --loadavg-target Y prints --thread-count --loadavg-target. Running with the arguments --thread-count=X --loadavg-target=Y fails because the loop that checks if an argument is supported checks if the entire argument is equal to a known argument. So it checks if "--thread-count=X == --thread-count"

The argument parser needs to be changed, I think, so it either accepts arguments to options, or checks if an option is supported by ignoring anything after an = sign.

kakra commented 2 years ago

Yeah I think there's a bug, I just looked at my old configuration: I passed the options to bees via OPTIONS= and that worked. Meanwhile, I don't use the script at all, I use the bees daemon directly which just works.

The script has multiple problems anyways, i.e. it expects .beeshome to be a subvolume even when not stored on a btrfs volume.

The problem with the arguments is that it parses the output from bees --help to figure out the supported options, and then just splits the word by word into two array: supported and unsupported options. It'll then assume that all unsupported options are a UUIDs to mount and pass to bees.

That's problematic from two perspectives: bees only supports a single filesystem only for some while now, there's no need to multiple path arguments, and it assumes everything that's not a word from the help to be a separate argument even when it doesn't match a UUID format. It should probably reverse the logic and look for arguments that look like a path or UUID, and just extract those, and pass the rest as-is.

JaviVilarroig commented 2 years ago

I can rewrite the script to use getopt to grab options.

Any issue with that approach?

kakra commented 2 years ago

Personally, I'd prefer if bees itself would do some of the things right within the C++ code (creating and size adjustments of the hash file), and only mounting deferred to outside of the process. But I think @Zygo as a different idea of how to split that due to reducing dependencies, and that is valid.

If you ask me, starting bees right within the correct directory, it should figure out automatically most things, then we could delegate mounting to the systemd process with maybe a small pre-exec script which adjusts the environment. The bash script should not bother with any bees arguments: If it sees a path or uuid, let it extract that, the rest should be passed as-is to bees. Ideally, it doesn't need any getopt at all. No fancy parsing of the bees --help output, no getopt.

Some ideas:

Just go through the arguments and either put them in a pass list, or if it starts with / or is in UUID format, handle that and put the converted argument to the pass list. Everything else would be duplicated effort because you'd need to maintain getopt formats in two places.

I could imagine that bees could have a "secret" argument like --check-params and if passed, it just checks all parameters for sanity and then exists with either 0 or 1, so the starter script would now early if the command line works.

For systemd mode, a ExecStartPre script could parse the configuration files and export the proper environment variables, then just exit, and let bees run directly. bees may internally need some additional support for environment variables but that should be all. That's similar to how ElasticSearch works setting up its environment.

What do you think?

kakra commented 2 years ago

BTW: There was an attempt by me to add configuration file support into bees natively, but I lost track of that during some personal live changes. The idea has not died yet but the branch is very outdated and I'd need to recap what I did there. Maybe you can find it useful: https://github.com/kakra/bees/commits/feature/configuration-files

JaviVilarroig commented 2 years ago

Well. I just have a problem with CPU consumption ant trying to fix it, a shell script fix is fast and easy. I'm a pragmatic person an look for direct solutions to my problems. :smile:

That said. I don't like too much the current bees aproach to configuration and systemd interaction. Having to type the UUID of the drive every time I want to start or stop bees, for example.

I agree that adding using getopt is adding an additional dependency. But I don't know any distribution that does not packaged it by default :wink: There's also the option to forget about long arguments, limit it to short options an then use the bash built-in getopts to parse the option. This my approach at work when server writing scripts (among other things right now I work as a sysadmin).

Having a configuration file per btrfs set I think is OK. I don't want to give the same treatment to each of my btrf sets. But I agree that it must be bees itself reading the configuration file. I fact, I will put all the options in the configuration file. Having to modify the systemd unit file look ugly for me.

Still I consider that beesd is buggy right now and having a fast and working fix to it is a must. If not, the final users like me will end having search for their own solution. And this is bad.

So I propose to fix beesd script to work as documented and later on look for improving it.

Later on I can take a look at the bees code itself and have it looking for the configuration file and taking the configuration from there. I just need to find some time for that.

Expect to have two PR's coming from me. One for the script fix, and later on one for bees to read the configuration from the config file.

Can you buy this? :wink:

Zygo commented 2 years ago

Personally, I'd prefer if bees itself would do some of the things right within the C++ code (creating and size adjustments of the hash file), and only mounting deferred to outside of the process. But I think @Zygo as a different idea of how to split that due to reducing dependencies, and that is valid.

I might have had ideas but they were a long time ago...

I got tired of dealing with multiple incompatible implementations of libuuid so I ripped out the code in bees that handles uuids. I guess we only need the uuid-to-string function from that library, so we could just import it instead of trying to sync up with a lib that users have the other versions of.

For configuration, we need a solution that scales up to a lot of options. I'm thinking of having a simple 'name = value' inheritable dictionary-based configuration format. The names might follow a vaguely hierarchical convention like hash-algorithm and hash-bits.

  1. We put the defaults in a text file, turn them into C code, and compile them into the binary to set the defaults. That same text file can provide the help message text. Or alternatively we write it in C and then have a binary that dumps it out in different formats once compiled. Either way, there should be only one source for these that generates the other.
  2. Default-constructing a BeesConfig gives you a pile of filled-in configuration variables from the defaults. Also tests the parser for the non-trivial option value formats.
  3. We read the command line looking for --config file and --option name=value options, and set those options. We make the variable currently known as $BEESHOME one of those options. All the existing command line options also map to "bees options," and there are no new command-line options after that.
  4. We read /etc/bees/bees.conf (a file named by config.global) for more options
  5. We read $BEESHOME/bees.conf (a file named by config.local) for more options
  6. We hand the parsed options object to BeesContext's constructor, and bees runs.

Order and precedence gets a little wonky because we have to physically process item 3 before item 5, but we want the stuff in item 5 to be overridden by item 3. Not hard to solve with a priority system.

The stuff in the first 100 lines of src/bees.h becomes an option, and there's a clear path for adding more options, which we'll need if we start doing things like excluding subvols by path regexp, or specifying in fine detail how to slice and dice btrfs csums into bees hashes (that's 6 to 8 parameters by itself) or what sizes to use for extent size tiering (that's up to a 5x4 table of parameters describing a Venn diagram of what to include and exclude and what order to do things that are included).

So after we read all that, we have all the stuff in one place that is currently in three places:

With a parsed config we would have a specification for a hash table format. Then we look at the hash table that's on disk, and if it's different (or missing), we either abort with an error, or we convert the data from the format it's in to the format specified in the configuration. A config option tells us which of those to do.

Anyway that's all future stuff. For now, hack and slash the shell script until it works, if not properly, then at least usably.