dimo414 / bash-cache

Transparent caching layer for bash functions; particularly useful for functions invoked as part of your prompt.
GNU General Public License v3.0
73 stars 4 forks source link

bc::memoize: support caching more than one result at a time #23

Open waptaff opened 3 years ago

waptaff commented 3 years ago

Hi,

I was a bit saddened when I realized that bc::memoize would only keep the freshest invocation in memory:

foo() { … } && bc::memoize foo

foo 3 # Runs cold, normal.
foo 3 # Runs memoized, cool.
foo 4 # Runs cold, normal.
foo 4 # Runs memoized, cool.
foo 3 # Runs cold, sadly.
foo 4 # Runs cold, sadly.

It appears bc::cache supports the above usage pattern, but OTOH polluting the filesystem with files (and I/O) is not very appealing for my use case.

I'm curious, is there some fundamental reason that would make supporting the above usage pattern in bc::memoize a headache?

Thanks for your library, it really fills one of the many bash gaps — I can almost do away with function static variables thanks to you :)

dimo414 commented 3 years ago

Thanks for the feedback! You may find it interesting to dig into how bc::memoized is implemented, because it explains most of this limitation:

# I tidied up the actual output of `declare -f` to make it more readable, but this is the important details

$ demo() { printf "'%s' " "$@"; } && bc::memoize demo

$ declare -f demo
demo() {
    bc::memoize::demo "$@"
}

$ demo
''

$ declare -f demo
demo() {
    if (( $# == 0 )); then
        printf "%s" \'\'\ ;
    else
        bc::memoize::demo "$@";
    fi
}

$ demo 1 2
'1' '2'

$ declare -f demo
demo() {
    if (( $# == 2 )) && [[ "${1}" == 1 ]] && [[ "${2}" == 2 ]]; then
        printf "%s" \'1\'\ \'2\'\ ;
    else
        bc::memoize::demo "$@";
    fi
}

That's right! When the memoized function is invoked cold the function is re-defined, and the backing call's output is cached directly in the function body! If that doesn't disgust you I suggest seeking professional help 😉

I implemented this somewhat on a whim while trying to think about how I could avoid bc::cache's I/O requirements, exactly like your concern. I was actually most focused on the zero-argument use case since I cache many functions that don't take any arguments, and a lot of bc::cache's complexity and overhead comes from properly handling arguments and multiple cache keys. I was basically trying to see what I could optimize for if I targeted supporting a very finite (i.e. size=1) cache capacity.


polluting the filesystem with files (and I/O) is not very appealing for my use case.

Could you elaborate on your constraints, just so I understand them? I certainly appreciate that I/O overhead is undesirable, but maybe it's not as bad as you think. In particular, if you're able to point bash-cache at a tmpfs partition the cache will actually live in memory even though it makes file I/O calls. In my experience this proves quite fast.

is there some fundamental reason that would make supporting the above usage pattern in bc::memoize a headache?

Yes and no. It's certainly possible to have an in-memory multi-keyed cache, (e.g. via an associative array or possibly by extending the dynamic-redeclaration pattern in bc::memoize to support a slightly larger history). But it wasn't my target use-case when I was designing it, and there are tradeoffs with doing so that start to get messy.

A significant issue (which I actually need to call out expressly in the documentation) is that data cached in-process can't escape its subshell. So if you call a in-process-cached function in a command substitution or other subshell nothing "actually" gets cached. I'm not aware of any (reasonable) way to propagate data back to a parent shell without writing to the file system.

An additional concern is the loss of fidelity. Bash does a surprisingly poor job of handling data losslessly (see bc::_read_input for some gory details, and note that bc::cache doesn't even use that anymore). For bc::memoize I was willing to explore loosening that requirement, but it is something I care about.

There are probably additional reasons I'd remember if I started redesigning bc::memoize, but those are the important ones that come to mind.

waptaff commented 3 years ago

Thanks for the detailed reply!

When the memoized function is invoked cold the function is re-defined, and the backing call's output is cached directly in the function body! If that doesn't disgust you I suggest seeking professional help

I may need pro help, I don't find it disgusting. In other languages where static function variables exist, a first approximation memoization implementation would also store the function output in the function itself:

function foo(int bar) {
    static cached_results[];
    if undefined(cached_results[bar]) {
        cached_results[bar] = expensive_processing(bar);
    }
    return cached_results[bar]; 
}

I certainly appreciate that I/O overhead is undesirable, but maybe it's not as bad as you think.

Using cache files just feels dirty and avoiding them whenever possible is optimal in my book; they need careful design to avoid race conditions, they need to be cleaned up, they increase the attackable surface area of a program, and so on; you may already have correctly guessed I'm no fan of python's *.pyc executable cache generation either.

A significant issue (which I actually need to call out expressly in the documentation) is that data cached in-process can't escape its subshell.

I see; enlightening! Files make sense in that context, I can't figure out a more elegant way to transport variables back to a parent process.


One way I can get around the current “one result” limitation of the current bc::memoize implementation is to use currying; should you one day want to augment that function, it might be an easy way out:

#!/usr/bin/env bash

source bash-cache.sh

function curry() {
    local -r    new_function="${1}"; shift
    local -r    old_function="${1}"; shift
    local -r -a parameters=( "${@}" )
    local       cmd
    cmd="
        function ${new_function}() {
            more_parameters=\"\${@}\";
            \"${old_function}\" \"${parameters[@]}\" \"\${more_parameters[@]}\";
        }
    "
    eval "${cmd}"
}

function sleep_print() {
    sleep "${1}"
    printf 'I slept %s seconds\n' "${1}"
}

curry sleep_print_2 sleep_print 2 && bc::memoize sleep_print_2
curry sleep_print_3 sleep_print 3 && bc::memoize sleep_print_3

sleep_print_2 # Runs cold.
sleep_print_2 # Runs hot.
sleep_print_3 # Runs cold.
sleep_print_3 # Runs hot.
sleep_print_2 # Runs hot.
sleep_print_3 # Runs hot.
dimo414 commented 3 years ago

a first approximation memoization implementation would also store the function output in the function itself

Yeah but that's not redefining the function itself on every cold invocation :)

they need careful design to avoid race conditions, they need to be cleaned up, they increase the attackable surface area of a program

Tell me about it....

use currying

I had actually meant to suggest currying for memoized functions with a finite set of inputs, thanks for calling that out. We definitely could consider something like that in a generalized solution, but if we did so under the covers the behavior might be surprising since the curried function names would presumably be lossy, and we'd need to come up with an invalidation scheme. I'd also need to investigate exactly how much memory overhead this pattern incurs; one benefit(?) of bc::cache using the filesystem is we're not necessarily limited by free RAM.