Open d1manson opened 1 year ago
I'm not really sure what you're suggesting or what the issue is..
What do you want to change to a weakref?
The issue is that _key_from_args
might map two different sets of args to the same key in the cache.
It would be better to either use weakref directly (rather than str
), or else store a mapping from weakref to a truly unique id for each object.
Right, I think I understand what you were trying to say.
The first argument of args
can be the self
argument, which could be reused.
However, this is unlikely, as it's unlikely you'd be throwing away and recreating new instances of something you wanted to cache results on. So, in practice, I doubt this is a real issue.
weakref is not a str
, so it can't be used as a cache key, so I don't see a better solution. We could add some information to docs that recommends ensuring that __str__
is implemented and provides a unique result for every instance.
In my case, I caught it due to a flaky test, but you could imagine an ORM type use case where you have a user object or something with a .get_secret()
method, and if the user id is reused by pthon that would be bad!
But as i say at the bottom of the inital message, it's not even just the self argument. you could also have a situation where you are passing a user object into a function (not even a class method), and if the user object has the same id it will map back to the same cached result, again that would be bad.
I'm not convince it's likely to cause issues in practice. But, it has just highlighted another more likely issue:
class Foo:
def __str__(self):
return "foo"
All instances would get the same cache key with this class. So, maybe we should combine str(self)
with id(self)
somehow (which still allows you to fix the original issue by making a unique __str__
).
i think weakref is the way to go if possible; caching is likely one of the main purposes for that to exist at all
user object into a function (not even a class method)
Yep, maybe that is a concern, but I don't see a better solution than combining str(o) + id(o)
or something similar.
i think weakref is the way to go if possible; caching is likely one of the main purposes for that to exist at all
Again, cache keys are str
, how are we going to send a weakref to Redis?
It may also be a desired result that you get the same (cached) result when doing something like:
get_thing(User("foo"))
get_thing(User("foo"))
Even though they are not the same instance. So, actually, I think sticking to str()
may be the best option, and just add a warning to the documentation explaining this behaviour (and how to tweak it with a custom __str__()
).
I was thinking of something like:
obj_unique_str_from_ref = weakref.WeakKeyDictionary()
def upsert_unique_str for_obj(ob):
if ob not in obj_unique_str_from_ref:
obj_unique_str_from_ref[ob] = uuid() # or whatever
return obj_unique_str_from_ref[ob]
...
def _key_from_args(...):
..."-".join(upsert_unique_str(a) for a in args) # very roughly
This would work for in-memory and would prevent accidental reuse of keys when using redis, but would not allow reuse of keys across server instances....which defeats the point of redis. so you'd need another way to allow users to say they really do know what they are doing. Maybe you check for the presence of a specific magic method (not sure if __str__
is safe enough), e.g. __cache_key__
(which i just made up).
..but i'm definitely not claiming to be an expert!
OK, that could work, but I think reusing from different instances might be useful in some cases.
A custom __cache_key__
might also be useful, but not sure it's really adding anything over str()
.
I think we could add some examples that demonstrate appropriate uses, like:
class GeoPos:
def __init__(self, lat, long):
self.lat = lat
self.long = long
def __str__(self):
return "Coord({}, {})".format(self.lat, self.long)
get_location(GeoPos(1, 2))
get_location(GeoPos(1, 2)) # Returns cached location
class Unique:
def __init__(self):
self.uuid = uuid()
def __str__(self):
return "Unique({})".format(self.uuid)
get_value(Unique())
get_value(Unique()) # Not cached
Note that Sqlalchemy produces a __str__
which look like "User(id=4, email='foo@bar.example')"
, so things should work as expected there and wouldn't cause the problem you mentioned above.
I'm not going to push this any further than this message, but I do think the current behaviour is dangerous and should be addressed in some form, at the very least with a prominent warning in the docs.
Yes in many real-world cases there will be a sensible __str__
method in place, but this is far from a guarantee. It might be worth comparing against the functools
lr_cache
behaviour, which is likely something people are familiar with.
from functools import lru_cache
import aiocache
import asyncio
import random
class Thing:
pass
@lru_cache(maxsize=10)
def functools_cached_thing_argument(a_thing):
return random.random()
@aiocache.cached()
async def aiocache_cached_thing_argument(a_thing):
return random.random()
async def main():
functools_res_strs = set()
aiocache_res_strs = set()
for idx in range(100):
thing = Thing()
functools_res = functools_cached_thing_argument(thing)
aiocache_res = await aiocache_cached_thing_argument(thing)
print(f"{idx}: functools_res={functools_res}, aiocache_res={aiocache_res}, str={str(thing)}")
assert functools_res not in functools_res_strs
functools_res_strs.add(functools_res)
# if you uncomment this assertion it will fail, but the assertion above never fails.
#assert aiocache_res not in aiocache_res_strs
aiocache_res_strs.add(aiocache_res)
asyncio.get_event_loop().run_until_complete(main())
As far as I can tell the functools lru_cache
holds a regular reference (i.e. not a weak reference) to cached objects, which seems suboptimal, but it does have the effect of preventing reuse of ids for any objects still live in the cache (because if they are in the cache it means they haven't been deleted yet and thus their id cannot be reused!).
Yes in many real-world cases there will be a sensible
__str__
method in place, but this is far from a guarantee. It might be worth comparing against thefunctools
lr_cache
behaviour, which is likely something people are familiar with.
Umm, it appears to suffer the same issue in regards to keys. The only reason you don't hit the issue in your example is because the objects remain referenced in the cache and are never garbage collected.
Here it returns a hashed tuple of the arguments as the key: https://github.com/python/cpython/blob/main/Lib/functools.py#L477 Which you can see is just using the hash() function: https://github.com/python/cpython/blob/1de4395f62bb140563761ef5cbdf46accef3c550/Lib/functools.py#L443
If you leave maxsize at the default value and uncomment the assert, you'll see the aiocache version passes as well, because no objects have been garbage collected.
Your idea of a WeakKeyDictionary mapping to uuids seems like the only potential solution currently. It also still seems useful more often than not to have this duplication on the str() value, when considering temporary ORM objects and multiple processes (which you're more likely to be wanting if you're using a redis cache or something).
So, it seems to me like something that should be disabled on a case-by-case basis, and that can already be done with a custom __str__
, but maybe there's room to do something with the WeakKeyDictionary too..
If you leave maxsize at the default value and uncomment the assert, you'll see the aiocache version passes as well, because no objects have been garbage collected.
Yes, but that misses the point. The functools version can never map two different objects to the same cache key* as a side effect of the _HashedSeq
storing a reference to the tuple self[:] = tup
.. because for the duration that a given object is used somewhere in a cache key its id is guaranteed to not be available to CPyrhon for reuse (as it is still in use). This is not the case with aiocache, where objects used as cache keys can in fact be GCd and have their id repurposed.
In other words, the test here is the exact opposite of what you suggest, namely comment out the call to the functools version (rather than increase its ability to prevent GC) and then the aiocache version is left to fend for itself, and things actually get worse.
*Unless we consider custom hash implementation or collisions.
In other words, the test here is the exact opposite of what you suggest, namely comment out the call to the functools version (rather than increase its ability to prevent GC) and then the aiocache version is left to fend for itself, and things actually get worse.
I just mean that we are solving a fundamentally different problem and there is nothing we can take from lru_cache's implementation.
Currently WeakKeyDictionary is the only proposal to fix this. But, if you're using an external cache, that's almost certainly because you want the cache to persist between executions and/or to be shared between multiple processes. WeakKeyDictionary would essentially reset the cache on every execution (but, leave stale data around). Another problem with WeakKeyDictionary, is that you can't use mutable values. If you pass a list to the function, it will error.
I can't see a solution currently (other than an opt-in for WeakKeyDictionary, or customising __str__
, which both require user knowledge) that meets these demands.
e.g. my_func(1, 2, 3)
should be able to use a cached value between executions and between different processes.
See this on SO and links within, or a simple test case as follows:
This prints:
This means that the cache is not scoped properly. It sounds like using weakref is the better option if possible.
This is where it is used currently. Note that this actually applies to all references in the args/kwargs, not just the
self
object.