ITISFoundation / osparc-simcore

🐼 osparc-simcore simulation framework
https://osparc.io
MIT License
44 stars 26 forks source link

`redis.exceptions.LockError` when cloning projects concurrently #5803

Open bisgaard-itis opened 2 months ago

bisgaard-itis commented 2 months ago

Is there an existing issue for this?

Which deploy/s?

No response

Current Behavior

After resolving performance issues in storage I now see a lot of 502 status codes from Webserver when running https://github.com/wvangeit/osparc-pyapi-tests/tree/master/noninter1 against dalco-master. After digging into graylog I see that many (perhaps even all) arise from the same exception type in the wb-api-server:

Project [project_uuid='6dc3f228-06cb-11ef-bb37-02420a00f1d5'] already locked in state 'prj_states.locked.status='CLONING''. Please check with support.
Traceback (most recent call last):
  File "/home/scu/.venv/lib/python3.10/site-packages/simcore_service_webserver/projects/projects_api.py", line 1531, in lock_with_notification
    async with lock_project(
  File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/home/scu/.venv/lib/python3.10/site-packages/simcore_service_webserver/projects/lock.py", line 60, in lock_project
    raise ProjectLockError(msg)
redis.exceptions.LockError: Lock for project '6dc3f228-06cb-11ef-bb37-02420a00f1d5' user 61349 could not be acquired

This exception makes sense because the project attempts to create 100 clones of the project '6dc3f228-06cb-11ef-bb37-02420a00f1d5' at the same time. If that operation requires a lock then some of these will definitely fail. The question is how to solve it.

Expected Behavior

No response

Steps To Reproduce

No response

Anything else?

No response

bisgaard-itis commented 2 months ago

@sanderegg I can see you have been working on this, it would be great to discuss what a potential solution could be. My immediate intuition is that other tasks in the event loop which are also requiring the lock should await until the lock is released instead of throwing an exception straight away. I guess that's how a mutex would work when threading.

bisgaard-itis commented 2 months ago

@sanderegg I can see you have been working on this, it would be great to discuss what a potential solution could be. My immediate intuition is that other tasks in the event loop which are also requiring the lock should await until the lock is released instead of throwing an exception straight away. I guess that's how a mutex would work when threading.

One approach would be to remove the blocking=False here and instead introduce a blocking timeout.

bisgaard-itis commented 2 months ago

Closing this due to this

bisgaard-itis commented 2 months ago

Reopening this due a comment by @sanderegg. Potential solutions:

pcrespov commented 1 month ago

The workaround is to use templates and take advantage of a bug that you have right now there :-)