smadbe commented 5 months ago

Motivations

Currently under investigation (3/04/24): removing a children from the 694914435881177216 chapter is timeout-ing on pre-prod while the propagation is supposed to take place asynchronously.

Tasks

Identify what is slow in this request
Fix it

GeoffreyHuck commented 4 months ago

Test report

Scenario: Add a child to 694914435881177216, then while the propagation is going, remove the child.

Result: DEADLOCK after a wait of ~15s

[2024-05-02 13:23:37]  Error 1213: Deadlock found when trying to get lock; try restarting transaction
INFO (/home/debian/go/pkg/mod/github.com/!france-ioi/gorm@v0.0.0-20200829222212-0373d329b88b/scope.go:1055)
[2024-05-02 13:23:37]  [14703.99ms]  DELETE FROM `items_items`  WHERE (parent_item_id = 694914435881177216) AND (child_item_id NOT IN(4,5,6,4000,4002,4006,4020,4021,4022,4025,4026,4027,4105,4107,4205,4207,4307,4405,4407,4507,4605,4607,4705,4707,4805,4807,4905,4907,5005,5007,5205,5207,5405,5407))
[0 rows affected or returned ]
INFO (context.Background)
[2024-05-02 13:23:37]  sql-conn-exec map[duration:15.496955568s err:<nil> query:INSERT INTO permissions_propagate (group_id, item_id, propagate_to) SELECT parents.group_id, items_items.child_item_id, 'self' as propagate_to FROM items_items JOIN permissions_generated AS parents ON parents.item_id = items_items.parent_item_id JOIN permissions_propagate AS parents_propagate ON parents_propagate.group_id = parents.group_id AND parents_propagate.item_id = parents.item_id WHERE parents_propagate.propagate_to = 'children' ON DUPLICATE KEY UPDATE propagate_to='self']
INFO (context.Background)
[2024-05-02 13:23:37]  sql-res-rowsAffected map[duration:1.3µs err:<nil>]
INFO (/home/debian/go/pkg/mod/github.com/!france-ioi/gorm@v0.0.0-20200829222212-0373d329b88b/scope.go:1055)
[2024-05-02 13:23:37]  [15497.45ms]
                INSERT INTO permissions_propagate (group_id, item_id, propagate_to)
                SELECT
                        parents.group_id,
                        items_items.child_item_id,
                        'self' as propagate_to
                FROM items_items
                JOIN permissions_generated AS parents
                        ON parents.item_id = items_items.parent_item_id
                JOIN permissions_propagate AS parents_propagate
                        ON parents_propagate.group_id = parents.group_id AND parents_propagate.item_id = parents.item_id
                WHERE parents_propagate.propagate_to = 'children'
                ON DUPLICATE KEY UPDATE propagate_to='self'
[5146 rows affected or returned ]

Note: Tested with sync propagation, but the behavior should be the same with async (we wouldn't have the propagation logs in this case).

The propagation query in question is the first query of the permission propagation transaction. We know it's the first query of a transaction because we now have step by step transactions for the permissions propagation.

The DELETE FROM... query comes from the updateItem service.

Before further analysis, I'll do a report on the locks acquired by the updateItem service.

GeoffreyHuck commented 4 months ago

Write locks acquisition of `updateItem` service.

Note: When a write lock is acquired on a row, it blocks all concurrent reads of the row.

Only the queries that acquire write locks.

Query 1

SELECT 
  items.participants_group_id, 
  items.type, 
  MAX(can_edit_generated_value) AS can_edit_generated_value, 
  items.duration, 
  items.requires_explicit_entry 
FROM 
  permissions_generated AS permissions 
  JOIN groups_ancestors_active AS ancestors ON ancestors.child_group_id = 931899637246949996 
  AND ancestors.ancestor_group_id = permissions.group_id 
  JOIN items ON items.id = item_id 
WHERE 
  (item_id = 694914435881177216) 
GROUP BY 
  item_id 
HAVING 
  (
    MAX(can_view_generated_value) >= 3
  ) 
  AND (
    MAX(can_edit_generated_value) >= 2
  ) FOR 
UPDATE

Here we lock for write the rows of permissions_generated for write. Those with item_id=694914435881177216. At first sight, it seems suspicious. Why do we want to block all reading of permissions while we update the item? It'll take some time to do the propagation anyway, and between the end of the service, and the propagation, the lock will be gone, and the rows in permissions_generated will stay the same.

Query 2

SELECT 
  permissions.*, 
  items.type 
FROM 
  `items` 
  JOIN LATERAL (
    SELECT 
      permissions.item_id, 
      MAX(can_view_generated_value) AS can_view_generated_value, 
      MAX(can_grant_view_generated_value) AS can_grant_view_generated_value, 
      MAX(can_watch_generated_value) AS can_watch_generated_value, 
      MAX(can_edit_generated_value) AS can_edit_generated_value, 
      MAX(is_owner_generated) AS is_owner_generated 
    FROM 
      permissions_generated AS permissions 
      JOIN groups_ancestors_active AS ancestors ON ancestors.ancestor_group_id = permissions.group_id 
    WHERE 
      (
        ancestors.child_group_id = 931899637246949996
      ) 
      AND (permissions.item_id = items.id) 
    GROUP BY 
      permissions.item_id 
    HAVING 
      (
        MAX(can_view_generated_value) >= 2
      )
  ) AS permissions ON permissions.item_id = items.id 
WHERE 
  (
    items.id IN (
      4, 5, 6, 4004, 4005, 4024, 4026, 4027, 
      4105, 4107, 4207, 4305, 4307, 4407, 
      4507, 4605, 4607, 4705, 4707, 4805, 
      4807, 4905, 4907, 5005, 5007, 5205, 
      5207, 5405, 5407, 5605, 5607, 5805, 
      5807, 5905, 5907, 6005, 6007, 1435350508173515102
    )
  ) FOR 
UPDATE

Here, we lock for write all the rows in items of the children of the item we're updating. Again it looks suspicious. It means that we block the display of the page of an item, even if it's being accessed from a different path (ex. If Chapter 4004 is also a root chapter, accessing it's page would be blocked because we are updating another of its parent).

Query 3

SELECT 
  child_item_id AS item_id, 
  category, 
  score_weight, 
  content_view_propagation_value, 
  upper_view_levels_propagation_value, 
  grant_view_propagation, 
  watch_propagation, 
  edit_propagation, 
  request_help_propagation 
FROM 
  `items_items` 
WHERE 
  (
    items_items.parent_item_id = 694914435881177216
  ) FOR 
UPDATE

Here, we get a write lock on items_items with parent_item_id of the item we're updating. This looks more reasonable, but yet... Why would we block the readings of say, grant_view_propagation, which is present in items_items of the ALL the children. It would block the propagation.

Query 4

This is the DELETE FROM... query with the previous deadlock, and also the one that was very slow in a previous step.

DELETE FROM 
  `items_items` 
WHERE 
  (parent_item_id = ?) 
  AND (
    child_item_id NOT IN(
      ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, 
      ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, 
      ?, ?, ?, ?, ?, ?, ?
    )
  )

Here, we get write lock only on the exact rows that are deleted (a subset of the previous SELECT).

GeoffreyHuck commented 4 months ago

Test Report 2

Background: For now, I removed the write locks that were suspiscious

Scenario: Run updateItem to delete a child, and immediately after (while the propagation) is running, add the child again

The services completed. Then the two permissions propagation ran concurrently and were very slow, and the connection to the database did timeout. This shouldn't happen in production though. The async propagation is made so that propagation cannot run concurrently.

Problem: The display of the item's page on the server was inaccessible during the permission propagation. This means that the permission propagation locks for write things that we need to read to display the item's page. And the read is blocked until the end of the transaction. This is suspicious. And since one propagation transaction can run for more than 30s as of today, we shouldn't do that.

GeoffreyHuck commented 4 months ago

How it should be

Before going to deep about the current state of the locking, and before checking what's possible or not, here's a quick note on how I think it should be:

Propagation of all sort are background processes that shouldn't impact other processes. In particular: propagation MUST NOT lock anything that is accessed by other processes. It MUST NOT block any read. If for some reason it has to, then we should rewrite the processes so it's not the case.
Propagation tables must be the only ones that are locked and updated within a propagation, not other table rows must be locked. This is not the case
No other tables than those of propagation must be updated during the propagation (seems to already be the case).
Propagation tables must never be updated from outside a propagation (this already is already the case, I think but didn't verify).
The propagation will "eventually" be done. But we don't know when. No process should wait for the propagation to be done. We accept that people still have access to something even when the access has been removed, until the propagation has been executed. We have two counter examples to this that will need to be fixed (fortunately they are related to groups propagation, so it's not critical):
- The CreateRelation function, used in the groupAddChild and the accessTokenCreate and refreshToken (for the badges) services.
- The DeleteRelation function, used only in the groupRemoveChild service.

GeoffreyHuck commented 4 months ago

Current state of the locks

Write locks were added too lightly in the codebase (a write lock, once acquired, blocks all reads for the locked rows, until the end of the transaction which acquired the lock). We probably don't need most of them. InnoDB is capable of detecting most of the issues to not require manual locking (I cannot tell when yet, need more research). A minority should be justified, otherwise none would have been added in the first place.

Problem is, it's way easier to add a non needed lock, than to be sure that a lock is not needed.

For the future, we should write a detailed and complete justification for each lock we add. This will allow to keep track of them, and also remove or rewrite once we have a better knowledge of locks.

Things to look for:

Optimistic locking can be used when we know that collision can happen but is very unlikely: https://stackoverflow.com/questions/129329/optimistic-vs-pessimistic-locking (note: setting write locks manually is also called pessimistic locking, and used when a collision is likely)
From MySQL manual, https://dev.mysql.com/doc/refman/8.0/en/innodb-deadlocks-handling.html
The best: remove locks that are not needed

GeoffreyHuck commented 4 months ago

Fixed by https://github.com/France-ioi/AlgoreaBackend/pull/1066 (hopefully). I close this.

France-ioi / AlgoreaBackend

Explain why it is still taking time to update children #1051

Motivations

Tasks

Test report

Write locks acquisition of `updateItem` service.

Query 1

Query 2

Query 3

Query 4

Test Report 2

How it should be

Current state of the locks

France-ioi / AlgoreaBackend

Explain why it is still taking time to update children #1051

Motivations

Tasks

Test report

Write locks acquisition of updateItem service.

Query 1

Query 2

Query 3

Query 4

Test Report 2

How it should be

Current state of the locks

Write locks acquisition of `updateItem` service.