ITISFoundation / osparc-simcore

🐼 osparc-simcore simulation framework
https://osparc.io
MIT License
46 stars 27 forks source link

`lock_context` does not release lock when it no longer owns it #6668

Open GitHK opened 1 day ago

GitHK commented 1 day ago

What happened?

It was detected that the resource-usage-tracker service did not have a lock inside redis and it was reporting the following error: redis.exceptions.LockNotOwnedError: Cannot reacquire a lock that's no longer owned.

What triggered this situation?

The cause is currently unknown.

What could trigger this situation?

Why is this bad?

Immagine having multiple instances relying on this lock, if somehow it is freed (and the lock_context does nothing), a different instance will acquire it and use it as if the resource (protected by the lock) was free to be used.

How does it present itself?

Consider the following code:

async with lock_context(...):
    # some user defined
    # code here

Two situations compete to producing the issue:

Is there a way to reproduce it in a test?

Yes, apply the following changes in your repo changes.diff.zip

Then run one of the following tests to make the issue appear:


What was already tried?

To mitigate the issue it was proposed to try and stop the user defined code. The only possible solution for a context manger is using signals (which are only available on linux).

When running test tests/test_redis_utils.py::test_context_manager_timing_out the code hangs unexpectedly inside the signal's handler functions. This means the entire asyncio runtime will be halted.

This approach cannot be used.

tests/test_redis_utils.py::test_context_manager_timing_out which produces a very unexpected result.

    def handler(signum, frame):
>       raise RuntimeError(f"Operation timed out after {after} seconds")
E       RuntimeError: Operation timed out after 10 seconds

tests/test_redis_utils.py:190: RuntimeError
> /home/silenthk/work/pr-osparc-redis-lock-issues/packages/service-library/tests/test_redis_utils.py(190)handler()
-> raise RuntimeError(f"Operation timed out after {after} seconds")

Different proposals

Feel free to suggest more below.

matusdrobuliak66 commented 1 day ago

Additional notes/observations:

GitHK commented 1 day ago

Additional notes/observations:

  • During the same period, a similar issue was observed in storage — the same Cannot reacquire a lock error was logged. The storage background task was running properly, and the lock key was present in the Redis lock DB table. (This is strange, the Resource Tracker issue differed because the key was missing there. However, why this error was logged in storage is unknown)

@matusdrobuliak66 can we please make a different issue out of this one? It is something different and I don't want to mix them.

matusdrobuliak66 commented 1 day ago

Additional notes/observations:

  • During the same period, a similar issue was observed in storage — the same Cannot reacquire a lock error was logged. The storage background task was running properly, and the lock key was present in the Redis lock DB table. (This is strange, the Resource Tracker issue differed because the key was missing there. However, why this error was logged in storage is unknown)

@matusdrobuliak66 can we please make a different issue out of this one? It is something different and I don't want to mix them.

Yes, no problem. I just wanted to make a note of it, because since we don't fully understand the issue yet, we can't be certain that those two are not somehow interconnected.