digitalfabrik / integreat-cms

Simplified content management back end for the Integreat App - a multilingual information platform for newcomers
https://digitalfabrik.github.io/integreat-cms/
Apache License 2.0
55 stars 33 forks source link

Race conditions in tree operations causing Internal Server Errors #1518

Open timobrembeck opened 2 years ago

timobrembeck commented 2 years ago

Describe the Bug

When content forms are submitted several times in succession too quickly, the same translation version can be created multiple times.

Steps to Reproduce

Send edit requests multiple times after another (for me, it didn't work to push the "save" button multiple times, but I could simulate this beaviour and we know that it already happened in the production system, so somehow this problem definitely exists)

Maybe, this is a combination of autosave and the real saving process?

Expected Behavior

There should only exist exactly one database object for each translation version.

Actual Behavior

In some cases, a version can be created multiple times, which breaks the CMS functionality in multiple places.

ulliholtgrave commented 2 years ago

As a first approach we could deactivate the form buttons after they got clicked. That won't really hurt and we can eliminate this possible source for the problem.

dkehne commented 1 year ago

Can you specify that a bit more that we can priorize it in our service team?

timobrembeck commented 1 year ago

Can you specify that a bit more that we can priorize it in our service team?

We already released a quick fix so no duplicate versions are created and an internal server error is shown instead. If you received no complaints from cities that they get errors when trying to save pages, this does not seem to be a big problem anymore. But we can also evaluate our log files to estimate how frequently this error occurs.

ulliholtgrave commented 1 year ago

However, this is probably the same problem that created the two Landshuts yesterday. As this error always requires some manual cleanup I would prioritize it rather high.

timobrembeck commented 1 year ago

For reference: This is the corresponding issue in our tree library which could have an impact on this: https://github.com/django-treebeard/django-treebeard/issues/53

timobrembeck commented 1 year ago

One example traceback during link replacement:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/views/generic/base.py", line 70, in view
    return self.dispatch(request, *args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/views/generic/base.py", line 98, in dispatch
    return handler(request, *args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/integreat_cms/cms/views/linkcheck/linkcheck_list_view.py", line 179, in post
    new_translation.save()
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/integreat_cms/cms/models/abstract_content_translation.py", line 470, in save
    super().save(*args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 739, in save
    self.save_base(using=using, force_insert=force_insert,
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 776, in save_base
    updated = self._save_table(
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 881, in _save_table
    results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 919, in _do_insert
    return manager._insert(
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/query.py", line 1270, in _insert
    return query.get_compiler(using=using).execute_sql(returning_fields)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1416, in execute_sql
    cursor.execute(sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/cacheops/transaction.py", line 98, in execute
    result = self._no_monkey.execute(self, sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.IntegrityError: duplicate key value violates unique constraint "pagetranslation_unique_version"
DETAIL:  Key (page_id, language_id, version)=(9652, 1, 3) already exists.

Traceback in page form:

    super().save(*args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 739, in save
    self.save_base(using=using, force_insert=force_insert,
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 776, in save_base
    updated = self._save_table(
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 881, in _save_table
    results = self._do_insert(cls._base_manager, using, fields, returning_fields, raw)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/base.py", line 919, in _do_insert
    return manager._insert(
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/query.py", line 1270, in _insert
    return query.get_compiler(using=using).execute_sql(returning_fields)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1416, in execute_sql
    cursor.execute(sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/cacheops/transaction.py", line 98, in execute
    result = self._no_monkey.execute(self, sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
    return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
    return executor(sql, params, many, context)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/opt/integreat-cms/.venv/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
    return self.cursor.execute(sql, params)
django.db.utils.IntegrityError: duplicate key value violates unique constraint "pagetranslation_unique_version"
DETAIL:  Key (page_id, language_id, version)=(15753, 3, 18) already exists.
svenseeberg commented 1 year ago

As far as I understand there are two things that go wrong:

Is this correct?

timobrembeck commented 1 year ago

Yes, however converting the library to using the Django ORM is probably unrealistic (even for a large upstream contribution), and I don't know how a custom mutex would be implemented here... maybe the same way as django-linkcheck would suffice?

https://github.com/DjangoAdminHackers/django-linkcheck/blob/8cb97b165b1d26f498896c8cda61f26d98f7a8ec/linkcheck/__init__.py#L7

svenseeberg commented 1 year ago

maybe the same way as django-linkcheck would suffice?

True, this would probably solve most issues that do currently arise. There are some edge cases (crashing processes, full hard disk, ...) for which a real transaction would be even better. But these are probably extremely rare in comparison to those race conditions.

svenseeberg commented 1 year ago

I grepped the logs today. We have only 12 of these messages in 14 days. I think it is okay to remove the "urgent" label.

svenseeberg commented 1 year ago

Removing the urgent label gave us bad karma it seems: in the last week we had broken tree structures in the database. It seems that changes in the tree structure were aborted right in the middle of a change. While there was no error in the application log, the solution seems to be the same as this issue here: we need transactions in Tree Beard.

timobrembeck commented 1 year ago

As mentioned in https://github.com/digitalfabrik/integreat-cms/pull/2082#issuecomment-1438535513, I'm pretty sure this is not completely resolved yet and needs further attention.

svenseeberg commented 1 year ago

Okay, maybe I have to stop referencing the issue in commits that I force push ;-)

svenseeberg commented 1 year ago

As mentioned in #2082 (comment), I'm pretty sure this is not completely resolved yet and needs further attention.

To make sure I understand what you're getting at: While the move() method is called whenever a page is saved, the page object is additionally being saved by the Django ORM. That means it touches the lft, rgt & tree_id values, potentially at the same time when another request/process has the class lock acquired. That would basically mean we need to acquire the lock when calling the save() method?

timobrembeck commented 1 year ago

I think the culprit is the somewhat confusing MoveNodeForm which is also instantiated in the normal page form. This also calls e.g. the model methods add_root()/add_child()/add_sibling(), even sometimes when the position of a page is not changed. (Because it always uses sibling nodes to remember the position of a page, a move from "left sibling of second sub-page" to "first child of parent page" causes the database queries to be executed, even if they do not change the position. But still, they have the potential of breaking something when running concurrently.)

So there are places where raw SQL queries are executed besides save() and move(), e.g. here. Another argument why this should reside in the library rather than in our CMS :sweat_smile:

I'm not completely sure which place would make the most sense here, we could e.g. overwrite the save()-method of our CustomTreeNodeForm? This will however cause dead locks in some cases, because the move() method is called somewhere in the sub-steps of the form's save() method... :cry:

Is there a way of checking whether the lock is already acquired by the currently executing code? Probably not, right? :see_no_evil:

So maybe we have to rigorously duplicate the mutex code for every model method which touches the database?

svenseeberg commented 1 year ago

Right. I expect we should see at least a decrease in our error rate with the current solution. If that is the case I'd be confident enough in the solution to create an upstream PR that fixes the problem for all queries.

timobrembeck commented 1 year ago

Quoting from the other issue:

Now that we have the mutex, I think we can definitely say that we can barely see a difference to before. We currently have 18 duplicate key violations in our log files. AFAICT this is roughly the same amount we had before we introduced the locking mechanism. I think we should remove the mutex and need to rethink the cause of the duplicate key violations.

I'm actually not sure anymore, I could also imagine that page moving was never really the problem and most of the race conditions come from normal page form saving (as described above)... If that's the case, then the lack of reduction of errors does not necessarily mean that django-db-mutex is not working, but maybe only that we're using it in the wrong places. Before discarding that library completely, we could also try to wrap all the node model methods into the mutex as well.

timobrembeck commented 1 year ago

Another idea from @ulliholtgrave: We could also move the page tree into their separate db tables based on the region, this would also make sure that they're independent and prevent and cross-region-race-conditions... This would also require some refactoring, see e.g. https://stackoverflow.com/questions/5036357/single-django-model-multiple-tables

svenseeberg commented 9 months ago

https://github.com/django-treebeard/django-treebeard/issues/278 could be relevant to solve this issue.

svenseeberg commented 9 months ago

Another idea from @ulliholtgrave: We could also move the page tree into their separate db tables based on the region, this would also make sure that they're independent and prevent and cross-region-race-conditions... This would also require some refactoring, see e.g. https://stackoverflow.com/questions/5036357/single-django-model-multiple-tables

Personally, I dislike for the idea of creating tables per region. That happened in WP and always was tedious to work with.