VorpalBlade / chezmoi_modify_manager

Tools for chezmoi to handle mixed settings and state
GNU General Public License v3.0
40 stars 5 forks source link

Overlapping regex matches: Invent sensible semantics for rule evaluation order #99

Closed VorpalBlade closed 5 months ago

VorpalBlade commented 6 months ago

(Really this belongs in ini-merge, but putting it directly in chezmoi_modify_manager for user visibility)

Consider something like:

ignore regex "ActivityManager" "switch-to-activity-.*"
transform regex ".*" ".*" kde-shortcut

Both of these will match

[ActivityManager]
switch-to-activity-4bbc8a8d-b350-48fc-9f33-8190f1772eea = some value

Currently we take the first match, which is source order for regex. This may be surprising, so a debug output like this is valuable. However we might also want that behaviour, as in this case. Ideas:

Also, order of evaluation is currently (undocumented, not guaranteed to not change):

  1. Is section ignored? If so skip remaining checks
  2. Is there a literal rule? If so skip remaining checks
  3. Is there one or more regex matches?

I don't want to slow things down by always checking all match types, so that might be an issue too.

elbowz commented 6 months ago

I have some trouble to understand (of course it's my mind):

Currently we take the first match, which is source order for regex.

I thought that this means the order in the modify_ script, so I have tried to swap the lines:

transform regex ".*" ".*" kde-shortcut
ignore regex "ActivityManager" "switch-to-activity-.*"

but nothing is changed:

I have misunderstood for sure, maybe an order priority/evaluation in the doc could be helpful (at least in my case).

What do you think about manage the modify_ script like pipelines (eg. GitHub actions)? These can be executed sequentially and the output of the above are the input of the below. eg.

transform regex ".*" ".*" kde-shortcut
ignore regex "ActivityManager" "switch-to-activity-.*"

In this case, the transform will be applied to all the keys, and after, the ignore statement will remove the "switch-to-activity-.*" key.

...but again, I'm sure I'm missing something how chmm (could be nice!) works behind.

In my mind, I would transform all in something like:

Perhaps, describing all with YAML or TOML? This should be works good also with templates, with the if statement to jump/include some parts (sections/keys) of the pipeline.

Anyway, ignore all my overthinking, but maybe a documented order evaluation and a better Warning message (with also the involved dotfile), could be usefull :)

VorpalBlade commented 6 months ago

but nothing is changed:

  • same warning after the chezmoi apply
  • same dot_config/private_kglobalshortcutsrc.src.ini

The difference should be in how it applies (chezmoi apply). In the specific case that the ActivityManager entry happens to exist in both the system state and the source state it would not make a difference however. You would have to have a situation where they are out of sync to see a difference.

There will be a warning though, that is indeed working as intended (for now, that is, this issue is about possibly changing that).

I have misunderstood for sure, maybe an order priority/evaluation in the doc could be helpful (at least in my case).

Yes, there has been a few discussions and issues over the past couple of days that have made me spot some holes in the docs for sure. What with my RSI I will see when I get around to it though.

What do you think about manage the modify_ script like pipelines (eg. GitHub actions)? These can be executed sequentially

It would be a very different approach than currently taken. The logic for ini-merging is currently implemented in the ini-merge crate.

Copying from another recent discussion:

The merging semantics implemented by Chezmoi_modify_manager (chmm for short, I should have picked a shorter name in retrospect):

  1. If a key or section is ignored, use the value from the config on the system (sys)
  2. If a transform is specified for a key or section apply it (special rules apply for each transform, see docs for each).
  3. Otherwise pick the value from the chezmoi source state (src)
  4. If a key or section is only in sys, but not in src, remove it (unless ignored)

The basic algorithm is:

  1. The rules are loaded into a data structure.
  2. The .src.ini is loaded into a map (dictionary, key-value table, associative array, not sure which programming languages you are familar with if any).
  3. The current system state (as given to us by chezmoi) is processed line by line, looking up the .src.ini state as well as referring to the ruleset to determine what to do for any given line.
  4. Just before the beginning of a new section (as well as at the end of the file) we check to see if there are any additional entries in the .src.ini that we didn't see in the system state, and that we should add in.

To be able to process rules in order (instead of the input system file in order) would need a completely different approach. I don't know what such an approach would look like, and I'm not the one that will design it.

Additionally, I like performance. One benefit of the current approach is that it works in a single pass over the input data (except for the src.ini) as well. Anything that doesn't might have a hard time beating it performance wise. (I benchmark every change and have tried many things that I ended up throwing away because they didn't provide statistically significant performance improvement, or ended up slowing things down unacceptably).

Also: do note that the warning is just that, a warning. It should probably be a notice or debug output instead. It isn't actually a problem if you intended that behaviour.

elbowz commented 6 months ago

The difference should be in how it applies (chezmoi apply). In the specific case that the ActivityManager entry happens to exist in both the system state and the source state it would not make a difference however. You would have to have a situation where they are out of sync to see a difference.

I meant, that this:

transform regex ".*" ".*" kde-shortcut
ignore regex "ActivityManager" "switch-to-activity-.*"

and this:

ignore regex "ActivityManager" "switch-to-activity-.*"
transform regex ".*" ".*" kde-shortcut

are the same for chmm. The order of the statements/commands is not important when the rules are applied, right?

The merging semantics implemented by Chezmoi_modify_manager (chmm for short, I should have picked a shorter name in retrospect):

  1. If a key or section is ignored, use the value from the config on the system (sys)
  2. If a transform is specified for a key or section apply it (special rules apply for each transform, see docs for each).
  3. Otherwise pick the value from the chezmoi source state (src)
  4. If a key or section is only in sys, but not in src, remove it (unless ignored)

Very useful, please add an enriched form of it in the main README.md. Maybe, you could add where and when are applied the other statements: add:remove and add:hide.

Is this the execution flow when you run chezmoi apply? I think could be a little bit different when you run chezmoi_modify_manager --add [file] to generate the src.ini file. Maybe also the flow execution of this could be useful, to put in the README.md :)

The basic algorithm is:

  1. The rules are loaded into a data structure.
  2. The .src.ini is loaded into a map (dictionary, key-value table, associative array, not sure which programming languages you are familar with if any).
  3. The current system state (as given to us by chezmoi) is processed line by line, looking up the .src.ini state as well as referring to the ruleset to determine what to do for any given line.
  4. Just before the beginning of a new section (as well as at the end of the file) we check to see if there are any additional entries in the .src.ini that we didn't see in the system state, and that we should add in.

This is rather clear, except the magic druing the point 3...but I think the merging schematic seen before is more than enough to use chmm.

To be able to process rules in order (instead of the input system file in order) would need a completely different approach. I don't know what such an approach would look like, and I'm not the one that will design it.

ahaha...of course ;)

Additionally, I like performance. One benefit of the current approach is that it works in a single pass over the input data (except for the src.ini) as well. Anything that doesn't might have a hard time beating it performance wise. (I benchmark every change and have tried many things that I ended up throwing away because they didn't provide statistically significant performance improvement, or ended up slowing things down unacceptably).

I had felt something about your love for performance reading the documentation, I don't figure out why in a project like this (sporadic execution of chmm) the performance are so important, but for sure your project your rules :) ...and the "pipeline approach" should be O(n^2), instead your algorithm I guess is O(n).

Also: do note that the warning is just that, a warning. It should probably be a notice or debug output instead. It isn't actually a problem if you intended that behaviour.

Totally agree, maybe you could add a more detailed information (maybe with a link to documentation for more information about it), and the system file and/or source one involved, to understand what and where cause the warning (IMHO).

chezmoi_modify_manager WARN] Overlapping regex matches for ActivityManager/switch-to-activity-56274a72-0b6e-4047-a116-237fda170f36 in the "dot_config/modify_*", first action taken

Thanks for time spent to code and answer!

VorpalBlade commented 6 months ago

are the same for chmm. The order of the statements/commands is not important when the rules are applied, right?

They should be different. Though I claim that after reading the code. I admit I haven't tested.

Very useful, please add an enriched form of it in the main README.md. Maybe, you could add where and when are applied the other statements: add:remove and add:hide.

The main README is overly long and cluttered, it will be some linked file in doc/ eventually. As for add:hide/remove they are only relevant when re-adding a file. This should be clear from chezmoi_modify_manager --help-syntax. Do you think that needs improvement or is the issue that this help text is too hidden away?

Is this the execution flow when you run chezmoi apply?

Yes, with some minor details left out regarding dealing with things like "keys before first section" (those should work as expected, but need to be handled with care in the code).

I think could be a little bit different when you run chezmoi_modify_manager --add [file] to generate the src.ini file.

It is indeed very different in that case, only ignore and add;* are processed and a simple filtering algorithm is applied. Not a merge at all.

This is rather clear, except the magic druing the point 3...

That magic would be the first list :)

I don't figure out why in a project like this (sporadic execution of chmm) the performance are so important,

Version 1.x was written in Python. With around 20 managed ini files, chezmoi diff etc was noticably slow. Note: chezmoi_modify_manager is executed once per managed file. Unfortunately I don't get the option of batch processing everything in a single instance.

So I rewrote it in rust with a focus on performance. I ended up ~50-60x faster per invocation. Python has a lot of startup overhead, and that dominated. The algorithm was the same in the Python version and the Rust version. Sure the Rust version was generally faster to actually run the logic too, but a large part was due to startup overhead.

On my desktop a typical single invocation of Rust chezmoi_modify_manager itself is ~2 ms, or 8 ms if using the keyring transform (yes, setting up dbus communication is slow). This is on Linux, I don't use other platforms myself so I don't know the numbers there. A full chezmoi diff is ~90 ms. On my older laptop though that same chezmoi diff is ~190 ms:

$ hyperfine "chezmoi diff --no-pager"
Benchmark 1: chezmoi diff --no-pager
  Time (mean ± σ):     193.5 ms ±  21.9 ms    [User: 140.4 ms, System: 64.4 ms]
  Range (min … max):   164.8 ms … 249.4 ms    17 runs

(This laptop hits jackpot on all the bad CPU vulnarabilities, meltdown, spectre, all of them basically, so it has lost a lot of performance compared to when it was new, and it is Intel from 2017 while my desktop is a AMD Ryzen 5600X). So yes I do believe performance matter. For the same reason I turn off GUI animations and optimise the startup time of zsh: The computer shouldn't needlessly waste my time.

VorpalBlade commented 6 months ago

I did an experiment whereby I converted literal actions to use regex actions internally (note: I didn't do proper escaping, that would need to be added as well). This would allow for rules to apply in order regardless if they are regex or literal. However, performance benchmarks did not look good:

$ hyperfine -N --input bench/heavy_ini --warmup 10 "target/release/chezmoi_modify_manager"{,.before}" bench/modify_heavy_ini.tmpl"
Benchmark 1: target/release/chezmoi_modify_manager bench/modify_heavy_ini.tmpl
  Time (mean ± σ):       2.8 ms ±   0.1 ms    [User: 1.8 ms, System: 1.0 ms]
  Range (min … max):     2.6 ms …   3.9 ms    1067 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: target/release/chezmoi_modify_manager.before bench/modify_heavy_ini.tmpl
  Time (mean ± σ):       2.2 ms ±   0.1 ms    [User: 1.4 ms, System: 0.7 ms]
  Range (min … max):     2.0 ms …   2.8 ms    1309 runs

Summary
  target/release/chezmoi_modify_manager.before bench/modify_heavy_ini.tmpl ran
    1.28 ± 0.06 times faster than target/release/chezmoi_modify_manager bench/modify_heavy_ini.tmpl

As regexes are compiled down into a single optimised RegexSet there doesn't seem to be a good performant way to support interleaving them with literal matches.

Thus the evaluation order will stay as is. This narrows the design space considerably. Remaining things to investigate:

EDIT: Hm, did a comparison on chezmoi diff as well, and here the difference is much smaller. Might have to think a bit more about this as well as test implementing the required escaping.

elbowz commented 5 months ago

They should be different. Though I claim that after reading the code. I admit I haven't tested.

Ok, so what I thought to have understood was totally wrong I'll wait the new documentation #101 to give me a new chance ;)

Do you think that needs improvement or is the issue that this help text is too hidden away?

I think the offline documentation (ie --help and man) should be the same or a subset of the online one. But this could be my mistake...

VorpalBlade commented 5 months ago

I think the offline documentation (ie --help and man) should be the same or a subset of the online one. But this could be my mistake...

That would be nice, but I haven't found a good way to generate all of those from a single "source of truth" , and as this is something I do in my spare time writing and maintaining docs more than once is not on the table.

You could have a markdown document as the source, but I found markdown to man page conversion to be mediocre (thus no man page currently). Generating help output from markdown might be viable, but I don't know what exists there already (and I'm not keen on writing something custom just for this project).

elbowz commented 5 months ago

That would be nice, but I haven't found a good way to generate all of those from a single "source of truth" , and as this is something I do in my spare time writing and maintaining docs more than once is not on the table.

Good point! Maybe a simple note in the online documentation, something like:

For more detailed information about XXXYYY see output of cmd $ chezmoi_modify_manager --help

or vice versa, if you prefer to maintain the documentation online, put a simple link in the man.

VorpalBlade commented 5 months ago

Implemented documentation (and a way to quieten the warning) in #111. I decided to not change the evaluation order, but just document the current one. Summary of that decision:

The last point is what really made the decision for me.