faster-cpython / ideas

1.69k stars 49 forks source link

Dict API and leaner and faster lru-cache #450

Open matthiasgoergens opened 2 years ago

matthiasgoergens commented 2 years ago

The implementation of functools.lru_cache and collections.OrderedDict stem from a time when regular dicts did not preserve order. Hence their C-implementations both use a custom written linked-list implementation each.

Ever since regular dicts changed to a compact representation and started preserving insertion order, there have been suggestions on and off about simplifying those two offshoots.

Serhiy Storchaka did some great work in this area, and even wrote three prototypes for how to simplify lru_cache. (Those are the only prototypes I am aware off, if you know of other prior work, please contribute a link!)

Unfortunately, Serhiy's code had a small bug that caused his most promising prototype to take amortised O(cachesize) time for a cache miss, instead of amortised O(1). At the time, that small flaw in the implementation consigned his clever idea to the dustbin of history.

Recently, motivated by some algorithmic work I was doing, I had some reason to look into this area. I wrote a prototype implementation and ran some benchmarks against current 'main'. (I did that before I learned of Serhiy's excellent prior work. I posted about my exploration and got some good feedback from Serhiy himself and from the ever knowledgeable Raymond Hettinger. Christian Heimes suggested I post here to get more feedback.)

First, the obvious: ditching the linked lists saves 7 machine words per cache-entry.

Second, the expected: the more misses your workload has the bigger the prototypes amortised speed advantage compared to 'main'.

Third, the surprising: the prototype had a better amortised performance than 'main' if you have any misses (even as low as 1%); it has about the same amortised performance as 'main' in the special case of a perfect 100% hits-only. The latter is the best case for the linked list implementation in 'main'.

How the magic is done

The prototype adds two (or three) new C-functions to the internal dict API:

// opaque blob for anyone outside of the dict internals.
typedef struct {
    PyDictKeysObject *sentinel;
    Py_ssize_t skip_empty;
} PyDictFinger;
PyAPI_FUNC(PyDictFinger) _PyDict_NewFinger(void);
PyAPI_FUNC(int) _PyDict_DelOldest(PyDictObject *mp, PyDictFinger *finger);
PyAPI_FUNC(PyObject *) _PyDict_GetAndPushBack_KnownHash(
    PyObject *op, PyObject *key, Py_hash_t hash);

Feel free to skip reading the draft implementations, I put them here for completeness only:

PyDictFinger _PyDict_NewFinger() {
    return (PyDictFinger) {.sentinel = NULL, .skip_empty = 0};
}

int
_PyDict_DelOldest(PyDictObject *mp, PyDictFinger *finger)
{
    PyObject *key, *value;
    Py_hash_t hash;

    if(finger->sentinel != mp->ma_keys) {
        finger->sentinel = mp->ma_keys;
        finger->skip_empty = 0;
    }

    if(!_PyDict_Next((PyObject *)mp, &finger->skip_empty, &key, &value, &hash)) {
        return -1;
    }
    return delitem_common(mp, hash, finger->skip_empty-1, value);
}

/* Same as _PyDict_GetItem_KnownHash() but also moves the item to the back of
   the dict, if found.
*/
PyObject *
_PyDict_GetAndPushBack_KnownHash(PyObject *op, PyObject *key, Py_hash_t hash)
{
    Py_ssize_t ix; (void)ix;
    PyDictObject *mp = (PyDictObject *)op;
    PyObject *value;
    PyDictKeysObject *keys;

    if (!PyDict_Check(op)) {
        PyErr_BadInternalCall();
        return NULL;
    }

  resized:

    // First: check whether we can find the key.  If not, return nothing.
    ix = _Py_dict_lookup(mp, key, hash, &value);
    assert(ix >= 0 || value == NULL);
    if(value == NULL) {
        return NULL;
    }
    // If we are here, we have found the item.
    keys = mp->ma_keys;

    // TODO: double check that this logic also works for split tables.
    // I think PyDict_SetDefault suggests that it does.
    if(ix + 1 >= keys->dk_nentries) {
        // We can skip the shuffling, if our entry is already at the end.
        return value;
    }

    keys->dk_version = 0;
    mp->ma_version_tag = DICT_NEXT_VERSION();

    if (_PyDict_HasSplitTable(mp)) {
        delete_index_from_values(mp->ma_values, ix);
        _PyDictValues_AddToInsertionOrder(mp->ma_values, ix);
        return value;
    }

    // NOTE: dk_nentries can grow up to USABLE_FRACTION(DK_SIZE(keys)), but
    // other parts of the code seem to take dk_usable as the limit.
    if (keys->dk_nentries >= USABLE_FRACTION(DK_SIZE(keys))) {
        /* Not enough space, need to resize. */

        // Not sure about this unicode business.  The logic is adapted from
        // insertdict.
        int need_unicode = (DK_IS_UNICODE(keys) && !PyUnicode_CheckExact(key)) ? 0 : 1;
        if (insertion_resize(mp, need_unicode) < 0) {
            return NULL;
        }
        goto resized;
    }

    Py_ssize_t hashpos = lookdict_index(keys, hash, ix);
    Py_ssize_t nix = keys->dk_nentries;

    if (DK_IS_UNICODE(keys)) {
        PyDictUnicodeEntry *ep = &DK_UNICODE_ENTRIES(keys)[ix];
        DK_UNICODE_ENTRIES(keys)[nix] = *ep;
        ep->me_key = NULL;
        ep->me_value = NULL;
    }
    else {
        PyDictKeyEntry *ep = &DK_ENTRIES(keys)[ix];
        DK_ENTRIES(keys)[nix] = *ep;

        ep->me_key = NULL;
        ep->me_value = NULL;
        ep->me_hash = 0;
    }
    dictkeys_set_index(keys, hashpos, nix);
    // We are using up space in the entries, but not in the indices.
    // So we only decrease dk_usable here, because insertdict (and other
    // places) don't check dk_nentries directly.
    keys->dk_usable--;
    keys->dk_nentries++;

    assert(value != NULL);

    return value;
}

An approximate Python rendering of the proposed lru-cache looks like this:

class lru_cache:
  def __init__(self, func, maxsize=128):
    self.maxsize = maxsize
    self.func = func
    self.d = {}
    self.finger = d.new_finger() # new proposed C-function.

  def __call__(self, *args):
      try:
        return self.d.GetAndPushBack(args) # new proposed C-function
      except KeyError:
        try:
          return self.d.setdefault(args, self.func(*args))
        finally:
          while len(self.d) > self.maxsize:
            self.d.DelOldest(self.finger) # new proposed C-function.

I wrote 'two (or three)' new C-functions above, because I also have a prototype that dispenses with the 'finger'-blob and just sticks 'skip_empty' into struct _dictkeysobject directly (at the cost of one extra machine word for each regular dict). Thus getting rid of _PyDict_NewFinger.

Benchmarks

For runtime benchmarks, I build both 'main' and the prototype with ./configure --enable-optimizations && make, and used this snippet on my mains-powered laptop:

def bench_mixed(cachesize, reps, hits):
    r = random.Random()
    f = functools.lru_cache(maxsize=cachesize)(lambda x: None)

    if hits > 0:
        keys = [r.randrange(math.ceil(cachesize/hits)) for _ in range(reps)]
    else:
        keys = keys = list(range(reps))
        r.shuffle(keys)
    gc.collect()

    start = time.perf_counter()
    for k in keys:
        f(k)
    end = time.perf_counter()
    return end - start

Then ran it like this:

hits = float(sys.argv[2])
reps = 1_000_000
cachesize = random.randrange(100_000)
t = bench_mixed(cachesize, reps, hits)

At the end Gnuplot made scatterplots of cachesize vs time taken, and linear regressions.

summary_none

As I wrote in the introduction, the prototype is generally faster.

A note on statistics: running the benchmark once is quite noisy. I decided to show exactly that noise in the scatterplots, because it gives at least an indication of the variance you can expect in real life. Then I added the linear regression lines to make the visual task of extracting a signal easier on the eyes.

For memory benchmarking I used the following:

def bench_mixed_cold(cachesize, reps, hits):
    r = random.Random()
    f = ft.lru_cache(maxsize=cachesize)(lambda x: None)

    keys = [r.randrange(math.ceil(cachesize/hits)) for _ in range(reps)]

    gc.collect()
    tracemalloc.start()
    first_size, first_peak = tracemalloc.get_traced_memory()
    tracemalloc.reset_peak()

    for k in keys:
        f(k)

    second_size, second_peak = tracemalloc.get_traced_memory()
    return second_size - first_size

Again with random cache sizes up 100,000 entries. Gnuplot produced this scatterplot and linear regression:

lru_mem

The slopes of the linear regression show a growth of 93 bytes / entry for the prototype, and 144 bytes / entry for 'main'. That's about 50% more memory usage for 'main', or 51 bytes. Given that the real graphs are more of a step-function than strictly linear, that's a pretty decent agreement with our theoretical expectation of 56 bytes saved.

Downsides

As you can see from the plots, the exact aggregate runtimes both for 'main' and the prototype vary a lot. (And that was the case even when I ran some benchmarks with the sequence of keys generated with a fixed random seed of 0.)

Naturally, when we look at runtimes of individual operations, the variance is even bigger than for the aggregate runtime.

In addition to those natural variations, both the prototype and 'main' have to compact their dictionaries occasionally, and that compaction takes time linear in the number of cache entries.

Naively speaking, the proposal compacts its dictionary every O(n) operations, both hits and misses.

Whereas 'main' compacts its dictionary every O(n) misses, and never compacts its dict after a hit.

For a workload with very few misses, the proposal does comparatively more compactions. However, surprisingly enough, the proposal's common case is so much faster, that it more than makes up for the compactions even in the case that favours 'main' the most, ie a perfect 100% hits-only workload.

General thoughts

The least recently used cache is an important tool of the standard library by itself, but it's also a good testbed to show what extra functionalities dicts can expose to allow users to implement their data structures and algorithms more conveniently and with better performance.

I started my journey, because I couldn't believe that the recommended way to randomly sample items from a dict took O(n) time and space in Python. And that there was no better way, either.

I managed to write a proof-of-concept of amortised O(1) random.choice from dicts using nothing but the public C API. Alas, it exploits an undocumented feature in the stable C-API. This discussion explains the technique.

It might be worthwhile to expose that undocumented feature more properly, too.

In general, there's a tension between keeping the public interface small, so that we don't hem in future changes nor alternative Python implementations, and between providing access to an expressive set of primitives.

But there's an opportunity here: because dicts are now guaranteed to be ordered, that already constrains implementations. So that gives us space to add a few carefully considered additional primitives without adding to the design constraints.

The two operations in this lru-cache prototype GetAndPushBack and DelOldest are just two examples. (The latter could perhaps be integrated as an option for popitem to take from the front. I haven't thought about the specifics.)

Ideally we can expose them on the Python level as well, but just adding them to the C-API for CPython only would already be useful.

We might also want to look at ordered dict, and see whether careful attention to detail in implementation can do for ordered dict what this proposal does for lru-cache. I remember reading the code for a previous attempt, but I can't find it again right now. I quite like the ideas there, but was not sure how careful the implementation was.

EDIT: it's at https://github.com/python/cpython/issues/75448

Links to prototype implementations

You can find my original prototype on github. That's the one I did the benchmarks on.

Raymond helpfully pointed out that my original prototype made _functoolsmodule.c know too much about the internals of dictobject.c. So I made a variant that uses an opaque finger as described above.

This is the most conservative version: there's no change at all to existing users of dict, because this version makes no changes to any of the existing dict operations, nor any changes to the existing C structs for dicts. It only affects lru-cache.

Last, but not least, the cleaner version makes the regular dict itself keep track of how many items to skip at the beginning.

This last version would impact other users of dict, because it adds an extra 1 machine word space overhead to dict.

(If you want to get tricky, DK_ENTRIES is an array of PyDictKeyEntry (or PyDictUnicodeEntry), and both variants have enough redundancy that you could sneak the required information in there without using an extra word.

typedef struct {
    /* Cached hash code of me_key. */
    Py_hash_t me_hash;
    PyObject *me_key;
    PyObject *me_value; /* This field is only meaningful for combined tables */
} PyDictKeyEntry;

typedef struct {
    PyObject *me_key;   /* The key must be Unicode and have hash. */
    PyObject *me_value; /* This field is only meaningful for combined tables */
} PyDictUnicodeEntry;

You could use ep->me_key == NULL in the first entry to mean that (Py_ssize_t)(ep->me_value) has information about how many non-active entries to skip.

methane commented 2 years ago

Your finger approach is not safe. Current dictresize() doesn't reuse the same memory block for new keysobject. But after two dictresize() call, original memory block for keysobject can be reused.

You could use ep->me_key == NULL in the first entry to mean that (Py_ssize_t)(ep->me_value) has information about how many non-active entries to skip.

I prefer this. FYI, similar hack was used before Python 3.6 for O(1) dict.popitem().

https://github.com/python/cpython/blob/75c1ca7b6c546ef674eb40bfe47f41e6045dc99b/Objects/dictobject.c#L47-L50

https://github.com/python/cpython/blob/75c1ca7b6c546ef674eb40bfe47f41e6045dc99b/Objects/dictobject.c#L2576

matthiasgoergens commented 2 years ago

Your finger approach is not safe. Current dictresize() doesn't reuse the same memory block for new keysobject. But after two dictresize() call, original memory block for keysobject can be reused.

Thanks for confirming my suspicion! I just went with that as the sentinel, because that's what collections.OrderedDict does. I wonder why it is safe there? Perhaps it is somehow guaranteed that they renew their sentinel often enough?

I think in the specific context of the lru-cache prototype, we could make this work, if we made sure to renew the sentinel (or just get a new finger) at least once for every resize. The only case that happens is when we have 100% hits.

But in any case: I much prefer not keeping an external 'finger' blob around, and just making dict remember how many non-active entries it has at its beginning. Thanks for digging up the prior art for that. I'll have a look at your links!