Open matthiasgoergens opened 2 years ago
Your finger approach is not safe. Current dictresize()
doesn't reuse the same memory block for new keysobject.
But after two dictresize()
call, original memory block for keysobject can be reused.
You could use
ep->me_key == NULL
in the first entry to mean that(Py_ssize_t)(ep->me_value)
has information about how many non-active entries to skip.
I prefer this. FYI, similar hack was used before Python 3.6 for O(1) dict.popitem()
.
Your finger approach is not safe. Current dictresize() doesn't reuse the same memory block for new keysobject. But after two dictresize() call, original memory block for keysobject can be reused.
Thanks for confirming my suspicion! I just went with that as the sentinel, because that's what collections.OrderedDict
does. I wonder why it is safe there? Perhaps it is somehow guaranteed that they renew their sentinel often enough?
I think in the specific context of the lru-cache prototype, we could make this work, if we made sure to renew the sentinel (or just get a new finger) at least once for every resize. The only case that happens is when we have 100% hits.
But in any case: I much prefer not keeping an external 'finger' blob around, and just making dict remember how many non-active entries it has at its beginning. Thanks for digging up the prior art for that. I'll have a look at your links!
The implementation of
functools.lru_cache
andcollections.OrderedDict
stem from a time when regular dicts did not preserve order. Hence their C-implementations both use a custom written linked-list implementation each.Ever since regular dicts changed to a compact representation and started preserving insertion order, there have been suggestions on and off about simplifying those two offshoots.
Serhiy Storchaka did some great work in this area, and even wrote three prototypes for how to simplify
lru_cache
. (Those are the only prototypes I am aware off, if you know of other prior work, please contribute a link!)Unfortunately, Serhiy's code had a small bug that caused his most promising prototype to take amortised O(cachesize) time for a cache miss, instead of amortised O(1). At the time, that small flaw in the implementation consigned his clever idea to the dustbin of history.
Recently, motivated by some algorithmic work I was doing, I had some reason to look into this area. I wrote a prototype implementation and ran some benchmarks against current 'main'. (I did that before I learned of Serhiy's excellent prior work. I posted about my exploration and got some good feedback from Serhiy himself and from the ever knowledgeable Raymond Hettinger. Christian Heimes suggested I post here to get more feedback.)
First, the obvious: ditching the linked lists saves 7 machine words per cache-entry.
Second, the expected: the more misses your workload has the bigger the prototypes amortised speed advantage compared to 'main'.
Third, the surprising: the prototype had a better amortised performance than 'main' if you have any misses (even as low as 1%); it has about the same amortised performance as 'main' in the special case of a perfect 100% hits-only. The latter is the best case for the linked list implementation in 'main'.
How the magic is done
The prototype adds two (or three) new C-functions to the internal dict API:
Feel free to skip reading the draft implementations, I put them here for completeness only:
An approximate Python rendering of the proposed lru-cache looks like this:
I wrote 'two (or three)' new C-functions above, because I also have a prototype that dispenses with the 'finger'-blob and just sticks 'skip_empty' into
struct _dictkeysobject
directly (at the cost of one extra machine word for each regular dict). Thus getting rid of_PyDict_NewFinger
.Benchmarks
For runtime benchmarks, I build both 'main' and the prototype with
./configure --enable-optimizations && make
, and used this snippet on my mains-powered laptop:Then ran it like this:
At the end Gnuplot made scatterplots of cachesize vs time taken, and linear regressions.
As I wrote in the introduction, the prototype is generally faster.
A note on statistics: running the benchmark once is quite noisy. I decided to show exactly that noise in the scatterplots, because it gives at least an indication of the variance you can expect in real life. Then I added the linear regression lines to make the visual task of extracting a signal easier on the eyes.
For memory benchmarking I used the following:
Again with random cache sizes up 100,000 entries. Gnuplot produced this scatterplot and linear regression:
The slopes of the linear regression show a growth of 93 bytes / entry for the prototype, and 144 bytes / entry for 'main'. That's about 50% more memory usage for 'main', or 51 bytes. Given that the real graphs are more of a step-function than strictly linear, that's a pretty decent agreement with our theoretical expectation of 56 bytes saved.
Downsides
As you can see from the plots, the exact aggregate runtimes both for 'main' and the prototype vary a lot. (And that was the case even when I ran some benchmarks with the sequence of keys generated with a fixed random seed of 0.)
Naturally, when we look at runtimes of individual operations, the variance is even bigger than for the aggregate runtime.
In addition to those natural variations, both the prototype and 'main' have to compact their dictionaries occasionally, and that compaction takes time linear in the number of cache entries.
Naively speaking, the proposal compacts its dictionary every O(n) operations, both hits and misses.
Whereas 'main' compacts its dictionary every O(n) misses, and never compacts its dict after a hit.
For a workload with very few misses, the proposal does comparatively more compactions. However, surprisingly enough, the proposal's common case is so much faster, that it more than makes up for the compactions even in the case that favours 'main' the most, ie a perfect 100% hits-only workload.
General thoughts
The least recently used cache is an important tool of the standard library by itself, but it's also a good testbed to show what extra functionalities dicts can expose to allow users to implement their data structures and algorithms more conveniently and with better performance.
I started my journey, because I couldn't believe that the recommended way to randomly sample items from a dict took O(n) time and space in Python. And that there was no better way, either.
I managed to write a proof-of-concept of amortised O(1)
random.choice
from dicts using nothing but the public C API. Alas, it exploits an undocumented feature in the stable C-API. This discussion explains the technique.It might be worthwhile to expose that undocumented feature more properly, too.
In general, there's a tension between keeping the public interface small, so that we don't hem in future changes nor alternative Python implementations, and between providing access to an expressive set of primitives.
But there's an opportunity here: because dicts are now guaranteed to be ordered, that already constrains implementations. So that gives us space to add a few carefully considered additional primitives without adding to the design constraints.
The two operations in this lru-cache prototype
GetAndPushBack
andDelOldest
are just two examples. (The latter could perhaps be integrated as an option forpopitem
to take from the front. I haven't thought about the specifics.)Ideally we can expose them on the Python level as well, but just adding them to the C-API for CPython only would already be useful.
We might also want to look at ordered dict, and see whether careful attention to detail in implementation can do for ordered dict what this proposal does for lru-cache. I remember reading the code for a previous attempt, but I can't find it again right now. I quite like the ideas there, but was not sure how careful the implementation was.
EDIT: it's at https://github.com/python/cpython/issues/75448
Links to prototype implementations
You can find my original prototype on github. That's the one I did the benchmarks on.
Raymond helpfully pointed out that my original prototype made
_functoolsmodule.c
know too much about the internals ofdictobject.c
. So I made a variant that uses an opaque finger as described above.This is the most conservative version: there's no change at all to existing users of dict, because this version makes no changes to any of the existing dict operations, nor any changes to the existing C structs for dicts. It only affects lru-cache.
Last, but not least, the cleaner version makes the regular dict itself keep track of how many items to skip at the beginning.
This last version would impact other users of dict, because it adds an extra 1 machine word space overhead to dict.
(If you want to get tricky,
DK_ENTRIES
is an array ofPyDictKeyEntry
(orPyDictUnicodeEntry
), and both variants have enough redundancy that you could sneak the required information in there without using an extra word.You could use
ep->me_key == NULL
in the first entry to mean that(Py_ssize_t)(ep->me_value)
has information about how many non-active entries to skip.