weakref rather than str(class_instance) due to id reuse

d1manson commented 1 year ago

See this on SO and links within, or a simple test case as follows:

class Thing():
    pass

thing_strs = set()
for idx in range(100):
    thing = Thing()
    print(f"{idx}: {str(thing)}")
    assert str(thing) not in thing_strs, f"{str(thing)} already used (idx={idx})"
    thing_strs.add(str(thing))

This prints:

0: <__main__.Thing object at 0x7fdfe8f74f90>
1: <__main__.Thing object at 0x7fdfe8f74fd0>
2: <__main__.Thing object at 0x7fdfe8f74f90>
Traceback (most recent call last):
File "<string>", line 8, in <module>
AssertionError: <__main__.Thing object at 0x7fdfe8f74f90> already used (idx=2)

This means that the cache is not scoped properly. It sounds like using weakref is the better option if possible.

This is where it is used currently. Note that this actually applies to all references in the args/kwargs, not just the self object.

Dreamsorcerer commented 1 year ago

I'm not really sure what you're suggesting or what the issue is..

What do you want to change to a weakref?

d1manson commented 1 year ago

The issue is that _key_from_args might map two different sets of args to the same key in the cache.

It would be better to either use weakref directly (rather than str), or else store a mapping from weakref to a truly unique id for each object.

Dreamsorcerer commented 1 year ago

Right, I think I understand what you were trying to say.

The first argument of args can be the self argument, which could be reused.

However, this is unlikely, as it's unlikely you'd be throwing away and recreating new instances of something you wanted to cache results on. So, in practice, I doubt this is a real issue.

weakref is not a str, so it can't be used as a cache key, so I don't see a better solution. We could add some information to docs that recommends ensuring that __str__ is implemented and provides a unique result for every instance.

d1manson commented 1 year ago

In my case, I caught it due to a flaky test, but you could imagine an ORM type use case where you have a user object or something with a .get_secret() method, and if the user id is reused by pthon that would be bad!

d1manson commented 1 year ago

But as i say at the bottom of the inital message, it's not even just the self argument. you could also have a situation where you are passing a user object into a function (not even a class method), and if the user object has the same id it will map back to the same cached result, again that would be bad.

Dreamsorcerer commented 1 year ago

I'm not convince it's likely to cause issues in practice. But, it has just highlighted another more likely issue:

class Foo:
  def __str__(self):
   return "foo"

All instances would get the same cache key with this class. So, maybe we should combine str(self) with id(self) somehow (which still allows you to fix the original issue by making a unique __str__).

d1manson commented 1 year ago

i think weakref is the way to go if possible; caching is likely one of the main purposes for that to exist at all

Dreamsorcerer commented 1 year ago

user object into a function (not even a class method)

Yep, maybe that is a concern, but I don't see a better solution than combining str(o) + id(o) or something similar.

i think weakref is the way to go if possible; caching is likely one of the main purposes for that to exist at all

Again, cache keys are str, how are we going to send a weakref to Redis?

Dreamsorcerer commented 1 year ago

It may also be a desired result that you get the same (cached) result when doing something like:

get_thing(User("foo"))
get_thing(User("foo"))

Even though they are not the same instance. So, actually, I think sticking to str() may be the best option, and just add a warning to the documentation explaining this behaviour (and how to tweak it with a custom __str__()).

d1manson commented 1 year ago

I was thinking of something like:

obj_unique_str_from_ref = weakref.WeakKeyDictionary()

def upsert_unique_str for_obj(ob):
   if ob not in obj_unique_str_from_ref:
      obj_unique_str_from_ref[ob] = uuid() # or whatever
  return obj_unique_str_from_ref[ob]

...
def _key_from_args(...):
     ..."-".join(upsert_unique_str(a) for a in args) # very roughly

This would work for in-memory and would prevent accidental reuse of keys when using redis, but would not allow reuse of keys across server instances....which defeats the point of redis. so you'd need another way to allow users to say they really do know what they are doing. Maybe you check for the presence of a specific magic method (not sure if __str__ is safe enough), e.g. __cache_key__ (which i just made up).

..but i'm definitely not claiming to be an expert!

Dreamsorcerer commented 1 year ago

OK, that could work, but I think reusing from different instances might be useful in some cases.

A custom __cache_key__ might also be useful, but not sure it's really adding anything over str().

I think we could add some examples that demonstrate appropriate uses, like:

class GeoPos:
    def __init__(self, lat, long):
        self.lat = lat
        self.long = long

    def __str__(self):
        return "Coord({}, {})".format(self.lat, self.long)

get_location(GeoPos(1, 2))
get_location(GeoPos(1, 2))  # Returns cached location

class Unique:
    def __init__(self):
        self.uuid = uuid()

    def __str__(self):
        return "Unique({})".format(self.uuid)

get_value(Unique())
get_value(Unique())  # Not cached

Note that Sqlalchemy produces a __str__ which look like "User(id=4, email='foo@bar.example')", so things should work as expected there and wouldn't cause the problem you mentioned above.

d1manson commented 1 year ago

I'm not going to push this any further than this message, but I do think the current behaviour is dangerous and should be addressed in some form, at the very least with a prominent warning in the docs.

Yes in many real-world cases there will be a sensible __str__ method in place, but this is far from a guarantee. It might be worth comparing against the functools lr_cache behaviour, which is likely something people are familiar with.

from functools import lru_cache
import aiocache
import asyncio
import random

class Thing:
    pass

@lru_cache(maxsize=10)
def functools_cached_thing_argument(a_thing):
    return random.random()

@aiocache.cached()
async def aiocache_cached_thing_argument(a_thing):
    return random.random()

async def main():
    functools_res_strs = set()
    aiocache_res_strs = set()
    for idx in range(100):
        thing = Thing()
        functools_res = functools_cached_thing_argument(thing)
        aiocache_res = await aiocache_cached_thing_argument(thing)
        print(f"{idx}: functools_res={functools_res}, aiocache_res={aiocache_res}, str={str(thing)}")

        assert functools_res not in functools_res_strs
        functools_res_strs.add(functools_res)

        # if you uncomment this assertion it will fail, but the assertion above never fails.
        #assert aiocache_res not in aiocache_res_strs 
        aiocache_res_strs.add(aiocache_res)

asyncio.get_event_loop().run_until_complete(main())

As far as I can tell the functools lru_cache holds a regular reference (i.e. not a weak reference) to cached objects, which seems suboptimal, but it does have the effect of preventing reuse of ids for any objects still live in the cache (because if they are in the cache it means they haven't been deleted yet and thus their id cannot be reused!).

Dreamsorcerer commented 1 year ago

Yes in many real-world cases there will be a sensible __str__ method in place, but this is far from a guarantee. It might be worth comparing against the functools lr_cache behaviour, which is likely something people are familiar with.

Umm, it appears to suffer the same issue in regards to keys. The only reason you don't hit the issue in your example is because the objects remain referenced in the cache and are never garbage collected.

Here it returns a hashed tuple of the arguments as the key: https://github.com/python/cpython/blob/main/Lib/functools.py#L477 Which you can see is just using the hash() function: https://github.com/python/cpython/blob/1de4395f62bb140563761ef5cbdf46accef3c550/Lib/functools.py#L443

If you leave maxsize at the default value and uncomment the assert, you'll see the aiocache version passes as well, because no objects have been garbage collected.

Your idea of a WeakKeyDictionary mapping to uuids seems like the only potential solution currently. It also still seems useful more often than not to have this duplication on the str() value, when considering temporary ORM objects and multiple processes (which you're more likely to be wanting if you're using a redis cache or something).

So, it seems to me like something that should be disabled on a case-by-case basis, and that can already be done with a custom __str__, but maybe there's room to do something with the WeakKeyDictionary too..

d1manson commented 1 year ago

If you leave maxsize at the default value and uncomment the assert, you'll see the aiocache version passes as well, because no objects have been garbage collected.

Yes, but that misses the point. The functools version can never map two different objects to the same cache key* as a side effect of the _HashedSeq storing a reference to the tuple self[:] = tup.. because for the duration that a given object is used somewhere in a cache key its id is guaranteed to not be available to CPyrhon for reuse (as it is still in use). This is not the case with aiocache, where objects used as cache keys can in fact be GCd and have their id repurposed.

In other words, the test here is the exact opposite of what you suggest, namely comment out the call to the functools version (rather than increase its ability to prevent GC) and then the aiocache version is left to fend for itself, and things actually get worse.

*Unless we consider custom hash implementation or collisions.

Dreamsorcerer commented 1 year ago

In other words, the test here is the exact opposite of what you suggest, namely comment out the call to the functools version (rather than increase its ability to prevent GC) and then the aiocache version is left to fend for itself, and things actually get worse.

I just mean that we are solving a fundamentally different problem and there is nothing we can take from lru_cache's implementation.

Currently WeakKeyDictionary is the only proposal to fix this. But, if you're using an external cache, that's almost certainly because you want the cache to persist between executions and/or to be shared between multiple processes. WeakKeyDictionary would essentially reset the cache on every execution (but, leave stale data around). Another problem with WeakKeyDictionary, is that you can't use mutable values. If you pass a list to the function, it will error.

I can't see a solution currently (other than an opt-in for WeakKeyDictionary, or customising __str__, which both require user knowledge) that meets these demands. e.g. my_func(1, 2, 3) should be able to use a cached value between executions and between different processes.

d1manson commented 1 month ago

for reference - https://github.com/satwikkansal/wtfpython?tab=readme-ov-file#-deep-down-were-all-the-same

aio-libs / aiocache

weakref rather than str(class_instance) due to id reuse #641