Remove persistence activation scripts and migrate to systemd units

The current activation script has accrued too much technical debt. There are a couple long-standing issues with it as well, such as permissions being set to parents too far out of their context (/var/cache/greeter persist makes /var/cache have owner and group greeter). There's also the desire to extend it in #8, which I have yet to really begin due to the technical debt.

My current implementation plan calls for having multiple independent systemd units that will work linearly to re-implement the logic of the current activation script. One big change I plan to make here though is to completely erase permissions management. Users of a non-ephemeral root probably don't obsessively set permissions across multitudes of files at each boot, rather they leave that duty to the program that creates them. We will leverage a similar idea whilst considering the unique challenge of reproducibility. The basic idea of how the systemd units will operate is:

Check if the file exists in the persist (usually /nix/persist)
If it doesn't exist: wait for the file to exist on the tmpfs, then move it to the persist.
If it does exist: bind mount it from the persist to the tmpfs.

I do not know systemd that well, but from my very quick research, step 2 can be implemented using a path unit, whilst step 3 can be implemented using a service unit. I'm not sure where step 1 will fit in, it might be its own check that then delegates to the other units, or it might be part of them. I suspect that once I start to actually flesh out an implementation and learn more about systemd I'll have a good idea of where I want to put these things.

There are a few problems to immediately consider and must be handled in a way that doesn't cause future headaches like the current implementation.

When will the file watcher perform the move? It has to do so once the process isn't using the entry, and once its done the move the watcher must no longer "watch".
How many units are necessary? Can a path unit + service unit do everything we need, or are there other required units?
What level of independence is each unit expected to have? This depends on the capabilities of each unit and what it's able to do.

I am marking this issue as high priority because it must be resolved before any changes are done to the persist module. It doesn't actually mean that it must be done quickly, only that it must be done first.

[!NOTE] I had a very long novel written here about the history of my module and why I want to move it, but I realized it had gotten too long to post here. Instead, I might just make it a blog entry over at https://web.frontear.dev in the next couple of weeks, once I get the time!

Currently I've made two template units, one as a path unit and the other as a service unit.

The path unit does as you expect, it simply watches some path for PathChanged. I chose this because this will only trigger when the path has been modified and has been closed by the process, meaning that any attempts to touch it now are safe.

The service unit automatically fires from the path unit when PathChanged triggers. Here, it copies the file and deletes after a successful copy. It then creates faux file/directories for the location and executes systemd-mount to bind mount them. It has a guard check with ConditionPathExists to ensure that it doesnt re-run from PathChanged if it already has persisted.

Here are the relevant units, written as NixOS modules (note that these do not handle any complex problems):

{
  pkgs,
  ...
}:
{
  config = {
    systemd.paths."persist@" = {
      unitConfig = {
        Description = "wait until creation of %f";
        ConditionPathExists = "!/nix/persist/%f";
      };

      pathConfig = {
        PathChanged = "%f";
      };
    };

    systemd.services."persist@" = {
      unitConfig = {
        Description = "copy %f to persistent device";
        ConditionPathExists = "!/nix/persist/%f";
      };

      serviceConfig = {
        ExecStart = "${pkgs.writeShellScript "persist.sh" ''
          rootPath="$1"
          persistPath="/nix/persist/$1"

          cp "$rootPath" "$persistPath" && rm -r "$rootPath"
          [ -f "$persistPath" ] && touch "$rootPath"
          [ -d "$persistPath" ] && mkdir "$persistPath"

          systemd-mount -o bind,"x-systemd.requires=$(dirname "$rootPath")" "$persistPath" "$rootPath"
        ''} %f";
      };
    };
  };
}

There are some problems though. First off, systemd-mount or even a mount unit have no special ways to bind mount, it's just a normal mount --bind with some niceties added on top. This is bad because we have special criteria for this.

The core issue is that a mount of any kind always needs a target, and that target must fully exist as a qualified file/directory. The traditional mount command can only perform an automatic make if --mkdir is provided, and even then, its only relevant for directory bind mounts. Files cannot be binded in this manner. systemd-mount can automatically make a file and all relevant directories which is really nice, except... they are all root owned with generic permission modifiers!

Now granted, I havent' played around with systemd-mount --owner at all here, but I don't think I need to. I suspect it would set the user and group as determined by /etc/passwd, but it would not correctly preserve the permission modifiers that we need during the recursive creation of the bind target.

Our solution in the original module was to recursively make every parent and set the permission of the desired candidate, which was bad because it leaked user and group owners outside of that candidate's directory tree. Fortunately, this time we don't actually care about users or groups, that's all determined by whatever we have saved in our persisted directories. All we have to do is reference what they have, ensuring consistency. The bad news is that we probably have to do this ourselves. I'll keep digging but it doesn't seem that it's supposed from systemd-mount or from making a mount unit. They allow permissions for I presume the target, but that's not really important for us.

That's not to say it's hopeless, if anything this is so much better than ever! One of the nice things a mount unit will do for us is determine a hierarchical ordering of descendants in the file tree automatically, which means there will no longer be any race conditions! I'm unsure if systemd-mount has the same niceties, given that it's an ad-hoc tool, but that may be worth checking out.

Another thing that I've learned is that systemd-mount -o bind [source] [target] will have the permissions of source set on target during the bind's lifetime. I don't actually know if this is unique to systemd-mount though, but since I've noticed it our game plan is much simpler.

Assuming we know that we have a persisted entry, and the root is completely fresh, we need to recursively create each directory that is necessary and then grab the permissions from the persist. This only needs to be done the first time we make these paths because all future paths can just re-use the existing directories. Furthermore, we only need to create up until the parent. We can let systemd-mount create the final target for us, which will also get us the permissions for that specific target correctly.

To illustrate it further with an example, let's say we want to bind /home/frontear/foo, and / is completely empty:

mkdir -p $(dirname /home/frontear/foo), and as each step when a new parent is created, find the parent on the persist and chmod --reference=<persist-parent> <root-parent> && chown --reference=<persist-parent> <root-parent>.
systemd-mount -o bind --owner=frontear /nix/persist/home/frontear/foo /home/frontear/foo, which will create foo only, then bind.

Unfortunately, I think the usage of systemd-mount means that we cannot benefit of systemd's hierarchial ordering based on the paths. This technically is not a problem at all since each mount operation is idempotent and self-sufficient thanks to the mkdir. I don't know if there would be any benefit to changing the approach we have right now to make use of this feature in systemd, partly because I suspect I'd need a mount unit per persist, and partly because I don't know if there'd even be a performance improvement.

On a final note, this current approach with having a template requires 2 units for each persisted path. This isn't too much of a big deal right now, because I think I have around 20 or so persists, but it has the potential to grow out of control. Either way, I see very big potential here!

Ignoring some of the typos I made in the earlier issues, we've got a couple of other problems.

Assume we have paths /foo and /foo/bar, where /foo/bar is a file and /foo is a directory.

If we start both path units and create /foo first, the path unit for /foo/bar fails because the watcher disappears after the /foo mount.
Furthermore, it wouldn't even make any sense to have it anymore given that the parent has been mounted.

This was my first test that I've done, I can't really do more given my limited usability in this environment. This identifies one single solution for me: we must simplify the persists and remove redundant ones.

An idea that just came to me was dividing up each "path" into attribute subsets upon insertion and compare to check if a common attrset already existed. A quick rudimentary example (using directories only):

Persist /foo/bar, converts into { "foo" = { "bar" = nil; }; }
Persist /foo/bar/baz, converts into { "foo" = { "bar" = { "baz" = nil; }; }; }
Insertion fails, as "foo"."bar" had been defined and thereby 'persisted' before.

I have no idea if Nix is capable of such logic. If it's not then we may have to delegate a lot of responsibility to more scripts, which is... less than preferable.

Here's an implementation idea for a comparator that can be used in the previously mentioned insertion algorithm. Written in mostly valid Nix with some pseudo-nix code:

let
  p1 = lib.setAttrByPath [ "foo" "bar" ] null;
  p2 = lib.setAttrByPath [ "foo" "bar" "baz" ] null;
in
  compareForCommonParents p1 p2 # recursive
  # 1. Check if both are attrsets (builtins.typeOf p# == "set")
  # 2 (true). Check if the 1st attrName matches.
  # 2 (false). The one that doesn't fail is deeper nested, prefer the failure. [DONE]
  # 3 (true). Descend once more from 1 [DONE]
  # 3 (false). If there was a common parent, prefer it. If not, these are distinct paths. [DONE]

This algorithm makes the assumption that both p1 and p2 can never be null. It also assumes that each subset will only contain one LHS = RHS, and that RHS will be either an attrset or null.

I've made a massive breakthrough in the module's design, rather than implementing the insertion logic manually, we can abuse the module systems evaluation with priorities to force parents to be higher in priority than their children, effectively manipulating the insertion order as its evaluated.

Confusing I know, but let's break this down. Up until now, I've already expressed the idea of converting a path into attribute set. This was made possible thanks to lib.setAttrByPath in combination with lib.splitString, allowing us to create the exact attrset graphs that we wanted. Breaking up the function into a list of attr names that could be used by lib.setAttrByPath was expressed as:

path: lib.pipe path [
  (lib.splitString "/") # break the path from the separator
  (lib.filter (x: x != "")) # remove empty, these occur from the root slash, and repeated slashes.
]

nix-repl> :p lib.pipe "/foo/bar" [ (lib.splitString "/") (lib.filter (x: x != "")) ]
[
  "foo"
  "bar"
]

We can join this list using lib.setAttrByPath. This was easily added as an extension of the previous code:

path: lib.pipe path [
  (lib.splitString "/")
  (lib.filter (x: x != ""))
  (ps: lib.setAttrByPath ps null)
]

nix-repl> :p lib.pipe "/foo/bar" [ (lib.splitString "/") (lib.filter (x: x != "")) (ps: lib.setAttrByPath ps null) ]
{
  foo = { bar = null; };
}

Here begins the tricky part. Originally I was going to do a recursive descent function that manually had to compare, but I kept thinking about how I could use the module systems' existing "merge" behaviour. I tried a few experiments and realized that one of the best ways to enforce "this over that" was via priorities!

This wasn't super straightforward to understand so I'll also try to explain this bit a little. Let's say we wanted the module system to "merge" /foo/bar and /foo/bar/baz. It needs to intelligently determine that /foo/bar/baz is unnecessary due to /foo/bar, no matter which order they are inserted. Originally I thought this was fairly simple, simply force a priority on the child. If the child has been set to null (aka in /foo/bar, bar = lib.mkForce null), it should prevent any modifications on that child from propagating. This sounded fine in theory but when implemented it did not work.

{
  foo = { bar = lib.mkForce null; };
}
// # represents module merge, not the nix operator
{
  foo = { bar = { baz = lib.mkForce null; }; };
}
# evaluation:
{
  foo = {
    bar = {
      baz = {
        _type = "override";
        content = null;
        priority = 50;
      };
    };
  };
}

What was going on? Why wouldn't the lib.mkForce on bar prevent the creation of bar.baz? Honestly, no idea! I suspect there may be some logicality issues with having different types, but I couldn't completely figure it out. I did however, find a very good solution: placing the priority one child up.

Since we can guarantee that our attribute set will exist, even if it doesn't have any children, we can simply place the priority on the last attrset that contains our null. In this manner, when nix attempts to merge them, it can clearly determine that one is of a higher value than the other! Thus, the entire responsibility of handling this logic goes to Nix! This also has the added benefit of simplifying our traversal later, because we will have the strictest set of parents locked. Furthermore, we can use this as flexibly as we want, since the insertion logic is handled by Nix, if we can "add" it to our option, we can leverage this flexibility.

In order to do this, we split the child attribute using lib.partition, created the parents using lib.setAttrByPath and set the child itself to a lib.mkForce { p = null; }. The full code and a reproducible test case can be seen below.

# module.nix
{
  config,
  lib,
  pkgs,
  ...
}:
let
  cfg = config.foo;

  fsPathToAttrs = (path: lib.pipe path [
    (lib.splitString "/")
    (lib.filter (x: x != ""))
    (ps: lib.partition (x: x == (lib.last ps)) ps)
    (prt: lib.setAttrByPath prt.wrong (lib.mkForce {
      "${lib.elemAt prt.right 0}" = null;
    }))
  ]);
in {
  options = {
    foo = lib.mkOption {
      type = with lib.types; attrsOf (nullOr attrs);
    };
  };

  config = lib.mkIf cfg.enable {
    foo = lib.mkMerge [
      (fsPathToAttrs "/a")
      (fsPathToAttrs "/a/b/c")
      (fsPathToAttrs "/a/b")
      (fsPathToAttrs "/a/d")
      (fsPathToAttrs "/a/d/e/f")
    ];
  };
}

nix-repl> :p nixosConfigurations.LAPTOP-3DT4F02.config.foo
{ a = null; }

Honestly this feels marvelous. I'm a little amazed at how cursed this is and yet how genius it is.

EDIT: Hello! I have made a silly mistake. Turns out I forgot that a module option is not subject to the rules of the module system. This whole time, I've been treating my attrsets as if their contents were also module system options, and thereby subject to the module system, but this isn't the case. Only the very top-level, aka the "children" at the top-level were subject to the override rules.

In hindsight, there were signs. Evaluating the config was leaving information about the priority, which should have been impossible. Back to the drawing board!

Old

Seems that playing around with the position of priority is not enough of a fix! Try something like `/var/lib` and `/var/db/cache` and you'll see that `/var/db` isn't considered. Working on a fix, I'm confident the solution is in the module system.

Good news! After some thinking for the last hour I came up with a new solution that abuses Nix's merging of attributes. More specifically it actually leverages lib.recursiveUpdate to only force the destructive merge on attributes that match, whilst keeping the rest safe.

The idea is pretty simple, after we convert out paths into a list, we must sort them. Why? We want to prioritize the smaller paths as they cover more ground from the longer paths (how /var/lib covers /var/lib/systemd). Sorting them is descending length gives us our paths where the longer ones are at the bottom. This is important for the next step, which is destructive merging.

When lib.recursiveUpdate attempts to merge the attrsets, it looks for a suitable candidate from the parents. If it can find one, it will destroy everything past the point that does not match and merge it. If it cant, itll simply add it like normal. It's actually important that we use lib.recursiveUpdate and not a standard // operation, as the latter would destroy elements that could co-exist fine ({ var.foo = null; } // { var.bar = null; } will be destroyed into { var.bar = null; }, not ideal).

Putting it all together, we can have the following code:

pathsToList = (paths: map (path: lib.pipe path [
  (lib.splitString "/")
  (lib.filter (x: x != ""))
]) paths);

pathsToAttrList = (pathsList: lib.pipe pathsList [
  (lib.sort (e1: e2: lib.length e1 > lib.length e2))
  (map (path: lib.setAttrByPath path null))
]);

pathsToAttr = lib.foldl lib.recursiveUpdate {};

my.persist-ng.toplevel = pathsToAttr (pathsToAttrList (pathsToList [
  "/etc/NetworkManager"
  "/var/lib"
  "/var/log"
  "/var/tmp"

  "/var/cache/tuigreet"

  "/var/cache/tuigreet/foo"
  "/var/db/sudo/lectured"
]));

And using it in a repl:

nix-repl> :p nixosConfigurations.LAPTOP-3DT4F02.config.my.persist-ng.toplevel      
{
  etc = { NetworkManager = null; };
  var = {
    cache = { tuigreet = null; };
    db = {
      sudo = { lectured = null; };
    };
    lib = null;
    log = null;
    tmp = null;
  };
}

Merging beautifully as we expected! This can be reverted back into paths that can be used by our persistence services to persist a minimal set of the most important paths.

It's been a couple of days. I have not had nearly as much time to test and work out things, but I have run into some problems in my limited runs.

First, I'm realizing it's more and more unsafe to try and perform an in-place bind mount. Linux fundamentally doesn't have a safe way to lock file access without causing all other processes to have issues. Of course, this isn't Linux's fault, but it's also not really the fault of the process either. What I'm trying to do goes beyond the scope of what both the kernel and the system processes would expect to happen to directories they own. As a result, I unfortunately believe that this approach is impossible to do safely, at least outside of kernel space.

Hope is not lost, because I'm thinking of an alternative approach: copying on shutdown. By this stage, the processes would have been shut down, and if it hasn't, we can wait out its termination. Once it has terminated, we can safely copy it over. There will be no bind mounting at this stage since the system is shutting down now, which makes binding it moot.

Unfortunately this approach has some glaring problems, more specifically with directories owned and managed by systemd. /var/log for example will stay open for most of the boot processes save the very last step, when one of the 5 services are fired: systemd-reboot.service, systemd-poweroff.service, systemd-halt.service, systemd-kexec.service, and systemd-soft-reboot.service, all of which come after final.target. This means that copying that directory will always lose some content from the initial boot, due to the fact that I cannot grab everything.

There was two solutions I thought of to this, one of which I even tried in practice but was unsuccessful. The first solution was to create a shutdown script that fires after the journal has been closed and all directories are remounted as ro. The problem here is that since everything is ro, I suspect /nix is too, which makes copying impossible without a re-remount. Of course, I have no idea practically how bad this may be given that by this point systemd wholly expects mounts to be gone, and the last few to be completely read-only. Another problem was that I couldn't really capture the logs for this easily, due to the journal not capturing them anyways. It may be worth looking into more since it's essentially one of the absolute last services that can exist on a system, but it's debugging is a bit of a nightmare.

The second solution was to accept that one full version of a path will likely always be lost. Instead, if the path is a file, we can probably perform a safe copy, but if it's a directory we really cannot, and as a result we simply do not. This has some logical issues as well as semantic issues. For one, how can we assert when a file can be safely copied? It seems to have the same problems as the earlier approaches. Secondly, losing one full cycle of data is pretty painful, and I'd preferably handle it in a different way.

The road is a bit bumpy now. At this point it honestly seems slightly more effective to simply re-use the current activation approach and port the permission logic from this onto that. It doesn't feel like this approach is solving any of the core problems the activation script had, all it's doing is giving me slightly more control when and where things can fire, which is nice but it's very limited.

Ultimately, I may slow down development on this because I'm realizing it's a bit more complicated than I had originally anticipated. That doesn't mean I plan to drop it; I just need a lot more time to actually explore and consider my approaches.

Frontear / dotfiles

Remove persistence activation scripts and migrate to systemd units #22