Closed Heliotrop3 closed 3 years ago
@leycec , if you have no corrections or suggested edits I'll get started on this.
This... this is a thing of rabid joy. You even reformatted double-quoted code snippets in my FIXME:
into monospaced Markdown. Depictions of my face are now incoming: :smiley: :smile:
...I'll get started on this.
Please do this only if it's fun, inspiring, and furnishes you with intellectual growth opportunities befitting your man-made providence as an enlightened (likely artificial) intelligence.
Also, take your time. I'll just be here tinkering in dark and fathomless segments of the codebase where none dare tread but screeching Rhesus monkeys.
LRUCache
should NOT satisfy thecollections.abc.MutableMapping
API - at least not initially.
This will be a little difficult. Why? As per PEP 3119:
The built-in type dict derives from MutableMapping.
The documentation for dict
leads to documentation for mappings which states:
A mapping object maps hashable values to arbitrary objects. Mappings are mutable objects. There is currently only one standard mapping type, the dictionary.
Ergo, by default dict
meets the criteria of a MutableMapping.
dict
Currently Python 3.6 is the only actively developed version where ordering is not explicitly a language feature. Although, the devs discussion revealed "... the regular dict does stay ordered across deletions in CPython3.6" Another chain confirms this.
Note: Raymond Hettinger confirms he "backdoored" this behaviour into PyPy. Later on he covers the question about preserving order on deletion.
OrderedDict
's DesignThere was discussion on the issue board around removing ordered dict. A summary of the discussion is as follows: OrderedDict
's design was intentional and aimed at cases not able to be met by the generalized dict
. In other words, OrderedDict
and dict
serve different purposes. Notably, the general consensus is the use-case for OrderedDict will substantially decrease now that dict
s in Python 3.7+ retain insertion order. Likewise, reviewing the changelogs seems to present OrderedDict
is getting phased out.
Convert inspect.Signature.parameters and inspect.BoundArguments.arguments from OrderedDicts to dicts Convert typing.NamedTuple's __annotations__ from OrderedDict to a regular dict. Convert namedtuple's _asdict() method from OrderedDict to dict Change dataclasses from OrderedDict to plain dict Migrate configparser from OrderedDict to dict
A statement to consider from the Author of OrderedDict
Note: I am uncertain whether the message is referring to that specific use case or in general.
Do we want to keep the current proposed implementation? Please note, this isn't a question I pose lightly - while I understand beartype only supports "...all actively developed versions of Python", the answer has ramifications on performance along with ease of implementing backwards compatibility should that ever be desired.
As mentioned earlier, Python 3.6 technically already retains insertion order. Likewise, this year marked the beginning of the end of development. Thus it would seem there's a decision to make:
dict
OrderedDict
The last two run counter to beartypes modus operandi so why consider it? OrderedDict
is good, heck it's specifically designed for this kind of thing. But Python 3.7+ brings some noticeable improvements to dict
- aside from retaining insertion order it also received non-trivial speed AND memory buffs. Implementing both serves as a way to get the best of both worlds: version dependent optimal performance which extends to Python 3.x. The decision will, I think, boil down to whether we want to accept the trade-offs 3.7+ dict
has - slower comparison and move_to_end
- or whether we want to consider the potential for supporting backwards compatibility.
Unless there 's a major trade-off I've overlooked, I'm in favor of running with the built-in dict since 3.6 is about to hit end of development. Likewise, if implementing this core feature is critical to beartype then development time becomes a factor - after looking at OrderedDict
there's a couple of things I'd want to continue to look into before designing and implementing a solution. that is one Pythonic way to implement a doubly linked list. I'm still familiarizing myself with the weakref
library but I think I get the gist - why keep an object alive if it's cached. I'm still curious whether we should re-implement OrdereDict
or use the built-in one - I'm going to work my way through the Cpython of the built-in dict and see what it's doing.
class LRUCache(dict):
....
def add_to_cache(self,key,value):
# If key is cached:
# Remove the key,value out of the dict
# Add the key mapped to the passed value back to the cache
# If the cache is not at capacity
# Add the key,value pair to the cache
# If the cache is at capacity
# Remove the item at the front of the cache
# Add the key,value to the cache
def get_item(self,key):
# Return the value associated with the key - None if no such association exists.
def set_item(self, key, value):
# Update the value for the associated key in the cache - Remove and reinsert
Ergo, by default
dict
meets the criteria of aMutableMapping
.
You're right, of course. You usually are.
I wrote that before I realized why OrderedDict
subclassed dict
; I thought someone had a good reason, but it turns out it was just premature optimization. Which is the best possible reason! We obsessively love premature optimization around here.
It's almost unhealthy.
I'm in favor of running with the built-in
dict
since 3.6 is about to hit end of development.
You raise relevant, pertinent, and wise questions in keeping with your elevated stature as a professional code oracle. In fact... you're right about everything. I mean, of course you are. You're the code oracle.
I'd initially contemplated just using an insertion order-preserving dict
, too. I swear! This sweat on my forehead is not an admission of guilt, but merely of the bright desk lamp tanning my hairless scalp. I foolishly tripped myself up in wondering how to efficiently access outdated keys. Of course, they're already at the head of the dict
. </sigh>
Exactly as you say, let us embrace the builtin dict
. The days of OrderedDict
are numbered.
Your LRUCache(dict)
pseudo-code is an excellent first stab into the inveterate darkness, but I wonder if we shouldn't just comply with the standard dict
protocol rather than defining our own non-orthogonal ad-hoc API? I am meditating on something like:
class LRUCache(dict):
def __init__(self, size: int) -> None:
assert isinstance(size, int), (
f'LRU cache size {repr(size)} not integer.')
assert size > 0, f'LRU cache size {size} not positive.'
super().__init__()
self._size = size
def __getitem__(
self,
key: 'Hashable',
dict_delitem=dict.__delitem__,
dict_getitem=dict.__getitem__,
dict_setitem=dict.__setitem__,
) -> object:
# Value associated with this key if this key exists in this dictionary
# *OR* raise the standard "KeyError" exception otherwise.
value = dict_getitem(self, key)
# Prioritize this key by removing and re-adding this key back to
# the head of this cache.
dict_delitem(self, key)
dict_setitem(self, key, value)
# Return this value.
return value
def __setitem__(
self,
key: 'Hashable',
value: object,
dict_hasitem=dict.__contains__,
dict_delitem=dict.__delitem__,
dict_setitem=dict.__setitem__,
dict_iter=dict.__iter__,
dict_len=dict.__len__,
) -> None:
# If this key exists in this dictionary, prioritize this key by
# removing this key *BEFORE* re-adding this key back below.
if dict_hasitem(self, key):
dict_delitem(self, key)
# (Re-)add this key to the head of this cache.
dict_setitem(self, key, value)
# If adding this key caused this cache to exceed its maximum capacity,
# silently remove the first (and thus least recently used) key-value
# pair from this cache.
if dict_len(self) > self._size:
dict_delitem(self, next(dict_iter(self)))
In theory, that should do us up. Maybe? I don't know. I'm just the grizzled Canadian whose cats routinely sleep on his head while he fitfully sleeps. No wonder I have nightmares all the time.
We could theoretically go one step further and subclass LRUCache
from some dictionary-specific weakref
data structure like WeakValueDictionary
, which is what any bog-standard boring LRUCache
implementation would do. But we are not boring! We are brave and foolish.
So, perhaps we'll just leave that up to the caller instead. I don't actually envision using weakref
with LRUCache
at the moment, because weakref
dangerously exposes objects to immediate garbage collection – whereas we actually want to persist iterators over dictionaries and sets between calls to @beartype
-decorated callables. If we stored weak rather than strong references to those iterators, they'd just immediately disappear on the next garbage collection run due to being born in the first generation (i.e., generation 0).
You are astoundingly clever. I am not. Therefore, I could yet again use your vital aid. I have FIXME:
comments at lines 483 and 654 of _pephint
detailing this absurd scheme. Can you gin up anything clever-er? My approach isn't clever, but it should work. Should. Or it will just blow up beartype
and leak memory everywhere like a C++ app on valgrind
and make a mockery of everything we stand for.
I wrote that before I realized why
OrderedDict
subclasseddict
; I thought someone had a good reason, but it turns out it was just premature optimization. Which is the best possible reason! We obsessively love premature optimization around here."We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." - Donald Knuth
While I would tend to agree, I'm inclined to think beartype is slightly unique. Why? We're aiming to keep as close to the metal with respect to Python as possible - i.e. the standard library. Thus our initial implementation relies heavily on understanding the STL, its underlying mechanics such as garbage collecting and implementation of data structs, and how to piece it together. In short a methodical approach to "Think twice, code once".
...I wonder if we shouldn't just comply with the standard
dict
protocol rather than defining our own non-orthogonal ad-hoc API? I am meditating on something like:class LRUCache(dict): .... def __setitem__( self, key: 'Hashable', value: object, dict_hasitem=dict.__contains__, dict_delitem=dict.__delitem__, dict_setitem=dict.__setitem__, dict_iter=dict.__iter__, dict_len=dict.__len__, ) -> None: # If this key exists in this dictionary, prioritize this key by # removing this key *BEFORE* re-adding this key back below. if dict_hasitem(self, key): dict_delitem(self, key) # (Re-)add this key to the head of this cache. dict_setitem(self, key, value) # If adding this key caused this cache to exceed its maximum capacity, # silently remove the first (and thus least recently used) key-value # pair from this cache. if dict_len(self) > self._size: dict_delitem(self, next(dict_iter(self))) ...
In theory, that should do us up. Maybe?
I'm inclined to agree - beauty in simplicity. #ThisIsTheWay (I like the move-to-end-then-prune approach!)
We could theoretically go one step further and subclass
LRUCache
from some dictionary-specificweakref
data structure likeWeakValueDictionary
, which is what any bog-standard boringLRUCache
implementation would do. But we are not boring! We are brave and foolish.I don't actually envision using
weakref
withLRUCache
at the moment, becauseweakref
dangerously exposes objects to immediate garbage collection – whereas we actually want to persist iterators over dictionaries and sets between calls to@beartype
-decorated callables. If we stored weak rather than strong references to those iterators, they'd just immediately disappear on the next garbage collection run due to being born in the first generation (i.e., generation 0).
"beartype.beartype._decor._code._pep._pephint" contains notes detailing how to implement a thread-safe cache for checking mapping and discarding objects with no existing external references. WeakValueDictionary
seems like a pre-built perfect fit:
Mapping class that references values weakly. Entries in the dictionary will be discarded when no strong reference to the value exists any more.
Is there a reason, aside from intellectual masochism, to shy away from the STL's shiny toy with automatic pruning?
You are astoundingly clever. I am not. Therefore, I could yet again use your vital aid. I have
FIXME:
comments at lines 483 and 654 of_pephint
detailing this absurd scheme. Can you gin up anything clever-er? My approach isn't clever, but it should work. Should. Or it will just blow upbeartype
and leak memory everywhere like a C++ app onvalgrind
and make a mockery of everything we stand for.
I'm just a geek who likes to dive deep - I'll open tickets for these after I research.
"beartype.beartype._decor._code._pep._pephint" contains notes detailing how to implement a thread-safe cache for checking mapping and discarding objects with no existing external references. WeakValueDictionary seems like a pre-built perfect fit:
Mapping class that references values weakly. Entries in the dictionary will be discarded when no strong reference to the value exists any more.
Is there a reason, aside from intellectual masochism, to shy away from the STL's shiny toy with automatic pruning?
I think so. It's hard to be sure without actually rolling up the ice-encrusted jacket sleeves and getting dirty with some actual pragmatic code, but the underlying issue here is that we aren't caching references to dict
or set
objects as such; we're caching references to iterator objects over dict
and set
objects (e.g., the objects returned by the dict.items()
and set.__iter__()
methods).
If we only cache weak references to those iterators, Python will probably just immediately garbage collect those iterators, because those weak references will be the only references to those iterators, which rather defeats the purpose of caching those iterators in the first place.
Ideally, Python wouldn't distinguish between a dict
object and the ItemsView
object returned by the dict.items()
method. If those objects were effectively just aliases or synonyms for one another, we wouldn't be having this discussion; we could just maintain a weak reference to that ItemsView
object secure in the low-level knowledge that that ItemsView
object would share the exact same lifetime as the dict
object that created it.
Naturally, this isn't the case:
> muh_dict = {'ugh': 'gah'}
> muh_dict_iter = muh_dict.items()
> id(muh_dict)
139641120632416
> id(muh_dict_iter)
139641120575952
These two objects have different IDs, which means they have different lifetimes. ItemsView
objects almost certainly persist internal strong references to their underlying dict
objects, because otherwise they wouldn't be reliable iterators. This means that persisting a strong reference to an ItemsView
also persists a strong reference to its underlying dict
object.
And here we are – up garbage collection creek without a paddle. :canoe:
...couldn't help myself. I added the completely untested LRUCache
class above with voluminous documentation to a new beartype._util.cache.utilcachelru
submodule. Feel free to poke at that with a debugger and possibly even define unit tests, if you like. Once one or both of us exercise this into the ground, we'll close this issue up.
Thanks again for all the heady conversation, Tyler! Having such an enthusiastic, inspired, and almost too clever codebro is a delightful change of pace from the oppressive chirping of winter crickets. :cricket:
Huzzah. It is done. You have my profound gratitude for forcefully pushing me from OrderedDict
to dict
. We sure dodged a bullet there! </wipes_forehead>
Formatted and edited version of @leycec's analysis of implementing a LRU Cache. See "beartype.beartype._decor._code._pep._pephint" for the the unabridged version.
Problem
Implement a highly optimized LRU Cache
Solution
Combine a doubly-linked list and a hash table into an efficient O(1) time LRU Cache with minimal space overhead.
Plan
We're evidently not the first to trod this path - a clever Gist leveraging this solution exists. While it's an inspiring implementation we want to avoid directly using
OrderedDict
. Why? The implementation of PEP 468 effectively puts an expiration date onOrderedDict
. Which is a shame since its source reveals its implemented through a doubly linked list.Lucky for us, we can avoid direct usage of
OrderedDict
. Specifically, we:beartype._util.cache.utilcachelru
submodule.Within this submodule:
LRUCache
class.OrderedDict
class which in turn:OrderedDict
.In short, use the 'OrderedDict' implementation as inspiration for our own LRU-specific implementation.
Notes
Subclassing Dict
Subclassing
dict
appears to purely be an efficiency optimization - methods are able to efficiently call superclassdict
methods WITHOUT dictionary lookups. This is achieved by leveraging default parameters via this clever calling convention :def __setitem__(self, key, value, dict_setitem=dict.__setitem__, proxy=_proxy, Link=_Link)
(We want to implement this)LRU Cache
\OrderedDict
implementationLRUCache
should NOT satisfy thecollections.abc.MutableMappin
API - at least not initially. Why? We don't require most of the functionality of theOrderedDict
class -- only the exact subset required to selectively implement an LRU cache. This means at least these standard dunder methods: