Vladimir-csp / xdg-terminal-exec

Proposal for XDG Default Terminal Execution Specification and shell-based reference implementation.
GNU General Public License v3.0
73 stars 11 forks source link

Cache #36

Closed Vladimir-csp closed 3 months ago

Vladimir-csp commented 9 months ago

Speed up things by caching selected Exec, ExecArg.

Invalidate the cache when something changes.

@fluvf, please take a look at this. What results do you get on subsequent runs with clean configs and XTE_CACHE_ENABLED=true?

fluvf commented 9 months ago

The cache seems to half execution time on all my machines, but there's a bug on main with the entry ID validator where many valid(?) entries are being skipped loudly That seems to be the reason for most of the time save since my last measurement

I've been trying to write comprehensive tests for the functions, and adding another complex code path gives me pause The idea is great, don't get me wrong, but maybe it could be implemented in a different way? Maybe somehow reusing the logic that already exists around the xdg-terminals folder... I'll keep thinking about it and open an issue if I come up with something clever

Vladimir-csp commented 9 months ago

there's a bug on main with the entry ID validator where many valid(?) entries are being skipped loudly

Did some tweaks on validators, did it help? What kind of entries are getting rejected?

Vladimir-csp commented 9 months ago

Off topic: There's movement: https://gitlab.freedesktop.org/terminal-wg/specifications/-/merge_requests/3 @fluvf @mkhl @FreeSlave If you have some input, please share.

fluvf commented 9 months ago

Did some tweaks on validators, did it help? What kind of entries are getting rejected?

All entries that have a - in the ID, can't remember off the top of my head if that's an illegal character

fluvf commented 9 months ago

Off topic: There's movement: https://gitlab.freedesktop.org/terminal-wg/specifications/-/merge_requests/3 @fluvf @mkhl @FreeSlave If you have some input, please share.

Some quick thoughts, unfortunately I don't have a gitlab account, so I'll leave them here:

I agree that xdg-terminals dir support could be dropped, or if necessary replaced with an env variable that controls the subdir searched I don't think backwards compatibility should be a concern when moving to 1.0 I also don't think keeping it for slow machines should be a priority. Even if it was, my old laptop now execs the script in ~0.6 secs (faster with the cache), which is plenty fast I agree that first run slow, subsequent runs fast is what the script should aim for, which will be solved by caching

About caching Based on some simple tests on my laptop the bottleneck is opening files, rather than reading them once opened That in mind, I don't think it's worth it to create a custom file format to store all the used information to exec a terminal outright Maybe a simple

if sha1sum -cs $CACHE_FILEPATH; then
    get_path_from $CACHE_FILEPATH
    contains $ENTRY_IDS $extracted_path # ID list before found fallback entries
    read_entry_path $extracted_path || fallback
else
    find_entry
    sha1sum $entry_path > $CACHE_FILEPATH
fi 

where $CACHE_FILEPATH contains

somehash  /path/to/entry/file.desktop

Maybe there's an even simpler way, but this shouldn't require big code changes

Vladimir-csp commented 9 months ago

if sha1sum -cs $CACHE_FILEPATH; then

Getting checksum of a single entry content leaves out other entries and configs. Configs can change, terminals be installed, removed. That is why I sum listing of everything potentially relevant.

Which is faster, reading a single file or listing a dir where file resides? How slow is this line of the script on your machine? LANG=C ls -LRl ${hash_paths} 2> /dev/null | md5sum 2> /dev/null. If cached run is faster, then fast enough it seems.

All entries that have a - in the ID, can't remember off the top of my head if that's an illegal character

- is legal, should work and it works on my system, tested with dash and bash as sh.

fortunately I don't have a gitlab account

It supports delegating a github account.

fluvf commented 9 months ago

if sha1sum -cs $CACHE_FILEPATH; then

Getting checksum of a single entry content leaves out other entries and configs. Configs can change, terminals be installed, removed. That is why I sum listing of everything potentially relevant.

Good point! The used config(s) could be checked and hashed aswell Adding and checking them using sha1sum -c is simple enough Realistic worst case would mean hashing one, maybe two additional files I don't think the script should care about the status of other entries

Which is faster, reading a single file or listing a dir where file resides? How slow is this line of the script on your machine? LANG=C ls -LRl ${hash_paths} 2> /dev/null | md5sum 2> /dev/null. If cached run is faster, then fast enough it seems.

I'm less concerned with speed and more with the complexity Writing tests that would catch all cornercases with the implementation seems, hard, as ls of directories can produce all sorts of output File ownership changes, or timestamps, or simply (un)installing additional irrelevant applications would all invalidate the cache

The only cornercase I can foresee with my proposal is (un)installing terminals with an empty config, where the chosen terminal would've changed based on their alphabetical sorting order This would have to be documented, but a fix seems unnecessary

mcepl commented 9 months ago

Off topic: There's movement: https://gitlab.freedesktop.org/terminal-wg/specifications/-/merge_requests/3 @fluvf @mkhl @FreeSlave If you have some input, please share.

I shared there https://gitlab.freedesktop.org/terminal-wg/specifications/-/merge_requests/3#note_2193993

Vladimir-csp commented 9 months ago

Adding and checking them using sha1sum -c is simple enough

This would actually read files. Both when creating and checking the cache.

Writing tests that would catch all cornercases with the implementation seems, hard, as ls of directories can produce all sorts of output File ownership changes, or timestamps, or simply (un)installing additional irrelevant applications would all invalidate the cache

Yes, this method of invalidation is quite sensitive and may be prone to false positives, but this way it's content-agnostic and does not read any files. In order to differentiate entries of terminals, one has to read them. Suppose a new terminal is installed, how would you invalidate the cache relying on *sum -c if previous cache knows nothing about new file.

ls -LRl is a POSIX-compliant command, expected to be alphabetical, and is done in a single call for everything. Specific output quirks do not matter, as long as the output is repeatable. The only two plausible false positive cases I can think of are:

How often that happens compared to launching terminals?

fluvf commented 9 months ago

... and does not read any files.

It still accesses the disk, but again, I don't think that should be the deciding factor I'm concerned about the implementations complexity and maintainability

Suppose a new terminal is installed, how would you invalidate the cache relying on *sum -c if previous cache knows nothing about new file.

I still don't understand why the script should care about this, if the configs haven't changed?

I think following the suggestion made by meven would be an improvement regardless You can even check the modification times of app folders to see if any entries have been added / removed

Vladimir-csp commented 9 months ago

I still don't understand why the script should care about this, if the configs haven't changed? You can even check the modification times of app folders to see if any entries have been added / removed

fluvf commented 9 months ago

I've thought about this and have a, hopefully more well informed, idea

It doesn't matter if the config files have changed, only the resulting ID list is important Removal of random entries doesn't matter, only the removal of the cached entry and newly added / modified entries are important The cache file, at the very least, has to store information about which entry and list of IDs were used during the previous run EDIT: And the value of $XDG_CURRENT_DESKTOP also has to be stored and checked

So the logic could be:

  1. Find and read any config files as normal
  2. Check / read the cache file, bail out if missing
  3. Compare the ID lists, bail out if not matching These checks should be in their own function, so the script could do find_cached_entry || find_entry || exit Maybe one that still calls find_entry internally
  4. Find entries, filter ones older than cache file, include cached path This can all be done using find -newer ... -o -path ... Maybe move the find call to a function, or prepend the values to it using alias find='...'
  5. Check found entries as normal, bail out if none valid (meaning the file at the cached path was modified) Best case, this only checks the cached entry path No need to check any other (older) entries, as the cached entry was already chosen over them previously
  6. Write found entry path and ID list (or its hash) to cache file

With this most of the existing code / functions could be reused I doubt this would result in egregious amounts of disk access Best case it's 3 file reads, whatever find does, and writing of the cache

Thoughts? I could write a working mock-up later, if you don't have time to make one yourself

fluvf commented 9 months ago

About reading the cache file, the script could use source to make it simpler There should also be something to check permissions, so random users can't inject shell code through it umask 0022 should be enough

Vladimir-csp commented 9 months ago

And the value of $XDG_CURRENT_DESKTOP also has to be stored and checked

Yeah, I've added it

I could write a working mock-up later, if you don't have time to make one yourself

Yes, thank you, I do not fully get it, so a mock-up would be very helpful.

About reading the cache file, the script could use source to make it simpler There should also be something to check permissions, so random users can't inject shell code through it umask 0022 should be enough

If user has wider umask than 0022 during application launching, something is probably wrong. OK, I will even pinch it down to 0077 during cache write for good measure.

Source is to be considered, but currently cache read is a rather strict and fail-early type of algorithm.