Race condition in case of multiple event loops

aio-libs / async-lru

Simple LRU cache for asyncio

https://pypi.org/p/async-lru

MIT License

713 stars 54 forks source link

Race condition in case of multiple event loops #611

Open bcmills opened 1 month ago

bcmills commented 1 month ago

It appears that the alru_cache decorator is using an unsynchronized OrderedDict instance for its cache: https://github.com/aio-libs/async-lru/blob/dc601e2b288336f4511e3ec1afad22cf7b0ab04e/async_lru/__init__.py#L103

If I am not mistaken, that will lead to a data race if the decorated function is called simultaneously from different asyncio event loops running on different threads — which seems to be a realistic (if rare) situation (compare https://github.com/python/cpython/issues/93462#issuecomment-1189390583).

If I understand correctly, other asyncio wrappers generally bind the event loop during initialization and fail explicitly if called from a different event loop. Unfortunately, since the alru_cache decorator generally runs at module init time there is no such loop on the thread.

I don't have a clear idea of how this problem could be fixed; I just wanted to point it out as a (significant and subtle) risk for users of this module.

bcmills commented 1 month ago

Note that even if the individual dict operations were protected by the GIL, there are read-modify-write sequences in the code that IIUC are not: https://github.com/aio-libs/async-lru/blob/dc601e2b288336f4511e3ec1afad22cf7b0ab04e/async_lru/__init__.py#L199-L214

Dreamsorcerer commented 1 month ago

Not sure if multiple loops is something we really want to care about, seems like a bad idea overall. But, could maybe prefix the cache per loop or something? {loop_id: {...}}

bcmills commented 1 month ago

Having multiple loops seems more-or-less inevitable if someone is trying to incrementally adopt async in a large existing codebase.

I would be reluctant to use a per-loop cache: it would depress the hit rate, and I'm not sure how you'd safely evict a loop from the cache to prevent a memory leak if it terminates.

I suppose you could use a threading.Lock instead? If there's only a single event loop a non-blocking acquire will always succeed, and if you do end up needing to block I think you could run the acquire in a separate thread (using loop.run_in_executor on the current loop).

tabrezm commented 4 days ago

I believe we're running into this problem or something related after having recently adopted async-lru.

We use a ProcessPoolExecutor to spin up processes which each have their own event loop. The event loop gets created and destroyed multiple times throughout the lifecycle of the process. We're seeing an issue where the cache result is sometimes empty for existing processes:

  File "/home/[redacted]/.local/lib/python3.12/site-packages/async_lru/__init__.py", line 208, in __call__
    return cache_item.fut.result()
           ^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/[redacted]/main.py", line 614, in <module>
    success, failed = future.result()
                      ^^^^^^^^^^^^^^^
  File "/opt/python3.12/lib/python3.12/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/python3.12/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
asyncio.exceptions.CancelledError

Dreamsorcerer commented 4 days ago

Well, feel free to try the solution I mentioned and open a PR.