carrotquest / django-clickhouse

This project's goal is to build Yandex ClickHouse database into Django project.
MIT License
102 stars 26 forks source link

LockError: Cannot release an unlocked lock #49

Closed MilanPecov closed 1 year ago

MilanPecov commented 1 year ago

This error occurs sporadically in production, and I'm not sure if it is simply an exception that needs to be handled properly. It occurs in the celery task that syncs PG and CH, and in particular it happens when Redis attempts to release the lock.

What is the best way to deal with this error? As a result, we're having some data loss issues.

Here is the complete stacktrace.


  File "/usr/local/lib/python3.8/site-packages/django_clickhouse/clickhouse_models.py", line 266, in sync_batch_from_storage
    storage.post_sync(import_key)
  File "/usr/local/lib/python3.8/site-packages/django_clickhouse/storages.py", line 251, in post_sync
    self.get_lock(import_key, **kwargs).release()
  File "/usr/local/lib/python3.8/site-packages/django_clickhouse/redis.py", line 39, in release
    self.lock.release()
  File "/usr/local/lib/python3.8/site-packages/redis/lock.py", line 232, in release
    self.do_release(expected_token)
  File "/usr/local/lib/python3.8/site-packages/redis/lock.py", line 238, in do_release
    raise LockNotOwnedError("Cannot release a lock"
redis.exceptions.LockNotOwnedError: Cannot release a lock that's no longer owned

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 412, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 704, in __protected_call__
    return self.run(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/celery.py", line 197, in _inner
    reraise(*exc_info)
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/_compat.py", line 54, in reraise
    raise value
  File "/usr/local/lib/python3.8/site-packages/sentry_sdk/integrations/celery.py", line 192, in _inner
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/django_clickhouse/tasks.py", line 25, in sync_clickhouse_model
    model_cls.sync_batch_from_storage()
  File "/code/apps/core/clickhouse/mixins.py", line 26, in sync_batch_from_storage
    super(DeletableClickHouseModelMixin, cls).sync_batch_from_storage()
  File "/usr/local/lib/python3.8/site-packages/django_clickhouse/clickhouse_models.py", line 271, in sync_batch_from_storage
    storage.post_sync_failed(import_key)
  File "/usr/local/lib/python3.8/site-packages/django_clickhouse/storages.py", line 259, in post_sync_failed
    self.get_lock(import_key, **kwargs).release()
  File "/usr/local/lib/python3.8/site-packages/django_clickhouse/redis.py", line 39, in release
    self.lock.release()
  File "/usr/local/lib/python3.8/site-packages/redis/lock.py", line 230, in release
    raise LockError("Cannot release an unlocked lock")
redis.exceptions.LockError: Cannot release an unlocked lock```
M1ha-Shvn commented 1 year ago

Hi. Each sync iteration is performed in [RedisLock](https://redis-py.readthedocs.io/en/v4.1.2/lock.html) (for each synced ClickhouseModel separately). This is done in order to reduce database load. Lock has 2 parameters limiting its maximum blocking time:

Error, that you have provided, tells us that sync process takes more time than timeout is set. Timeout can be set in several ways:

  1. ClickHouseModel.sync_lock_timeout attribute
  2. sync_delay * 10 (the default). sync_delay can be also set as ClickHouseModel.sync_delay or global config.SYNC_DELAY

In order to solve the error, you can:

  1. Investigate statsd metrics, looking for a "bottle neck" in your sync process. Take a look here for more info. Or you can enable DEBUG logging level and try digging into logs.
  2. Increase sync_lock_timeout or sync_delay for model you have problems with.
M1ha-Shvn commented 1 year ago

As a result, we're having some data loss issues.

I don't think it is somehow connected. This error can not cause data loss. It looks like some other error (not present in stack trace) is causing it, but you can't see it as it causes timeout.

MilanPecov commented 1 year ago

Hi,

Thank you for the quick reply and for the helpful information.

Yes, our problem is that every Monday we perform a large partial update on one table/model in PostgreSQL which needs to be synced into Clickhouse and this creates a lot of operations in Redis. We will try increasing the sync_lock_timeout and monitor what happens.

For additional logging, we use Datadog which has its own wrapper around statsd. I'm not sure if the current logging implementation will work out of the box for this, but it looks simple enough so maybe I'll try to make it work with Datadog, although it may require some minor tweaks.

I'll keep you posted if this solves the problem next Monday.

Thanks again for your help!