France-ioi / AlgoreaBackend

Backend for the new Algorea platform
MIT License
1 stars 2 forks source link

`start-result-path` request taking 250sec to resolve #1050

Closed smadbe closed 4 months ago

smadbe commented 5 months ago

Ongoing investigation:

curl 'https://dev.algorea.org/api/items/694914435881177216/5/4700/4707/4702/1625159049301502151/5207993233955027100/start-result-path' -X POST -H 'Authorization: Bearer ***'

is timeout after 15sec.

Geoffrey, launching it from his computer, has the request completing after 250sec.

Most of the time is spent on:

WITH visible_items AS (SELECT items.id AS id, requires_explicit_entry FROM permissions_generated AS permissions
                JOIN groups_ancestors_active AS ancestors
                        ON ancestors.child_group_id = 670968966872011405 AND ancestors.ancestor_group_id = permissions.group_id JOIN items ON items.id = permissions.item_id WHERE (IFNULL(can_view_generated_value, 1) >= 3) GROUP BY items.id) (SELECT attempts0.id AS attempt_id0, results0.started_at IS NOT NULL AS has_started_result0, attempts1.id AS attempt_id1, results1.started_at IS NOT NULL AS has_started_result1, attempts2.id AS attempt_id2, results2.started_at IS NOT NULL AS has_started_result2, attempts3.id AS attempt_id3, results3.started_at IS NOT NULL AS has_started_result3, attempts4.id AS attempt_id4, results4.started_at IS NOT NULL AS has_started_result4, attempts5.id AS attempt_id5, results5.started_at IS NOT NULL AS has_started_result5, attempts6.id AS attempt_id6, results6.started_at IS NOT NULL AS has_started_result6 FROM visible_items as items0
                                JOIN attempts AS attempts0 ON attempts0.participant_id = 670968966872011405 AND
                                        (NOT items0.requires_explicit_entry OR attempts0.root_item_id = items0.id)
                                        LEFT JOIN results AS results0 ON results0.participant_id = attempts0.participant_id AND
                                                attempts0.id = results0.attempt_id AND results0.item_id = items0.id JOIN items_items AS items_items1 ON items_items1.parent_item_id = items0.id AND items_items1.child_item_id = 5 JOIN visible_items AS items1 ON items1.id = items_items1.child_item_id
                                JOIN attempts AS attempts1 ON attempts1.participant_id = 670968966872011405 AND
                                        (NOT items1.requires_explicit_entry OR attempts1.root_item_id = items1.id) AND
                                        IF(attempts1.root_item_id = items1.id, attempts1.parent_attempt_id, attempts1.id) = attempts0.id
                                        LEFT JOIN results AS results1 ON results1.participant_id = attempts1.participant_id AND
                                                attempts1.id = results1.attempt_id AND results1.item_id = items1.id JOIN items_items AS items_items2 ON items_items2.parent_item_id = items1.id AND items_items2.child_item_id = 4700 JOIN visible_items AS items2 ON items2.id = items_items2.child_item_id
                                JOIN attempts AS attempts2 ON attempts2.participant_id = 670968966872011405 AND
                                        (NOT items2.requires_explicit_entry OR attempts2.root_item_id = items2.id) AND
                                        IF(attempts2.root_item_id = items2.id, attempts2.parent_attempt_id, attempts2.id) = attempts1.id
                                        LEFT JOIN results AS results2 ON results2.participant_id = attempts2.participant_id AND
                                                attempts2.id = results2.attempt_id AND results2.item_id = items2.id JOIN items_items AS items_items3 ON items_items3.parent_item_id = items2.id AND items_items3.child_item_id = 4707 JOIN visible_items AS items3 ON items3.id = items_items3.child_item_id
                                JOIN attempts AS attempts3 ON attempts3.participant_id = 670968966872011405 AND
                                        (NOT items3.requires_explicit_entry OR attempts3.root_item_id = items3.id) AND
                                        IF(attempts3.root_item_id = items3.id, attempts3.parent_attempt_id, attempts3.id) = attempts2.id
                                        LEFT JOIN results AS results3 ON results3.participant_id = attempts3.participant_id AND
                                                attempts3.id = results3.attempt_id AND results3.item_id = items3.id JOIN items_items AS items_items4 ON items_items4.parent_item_id = items3.id AND items_items4.child_item_id = 4702 JOIN visible_items AS items4 ON items4.id = items_items4.child_item_id
                                JOIN attempts AS attempts4 ON attempts4.participant_id = 670968966872011405 AND
                                        (NOT items4.requires_explicit_entry OR attempts4.root_item_id = items4.id) AND
                                        IF(attempts4.root_item_id = items4.id, attempts4.parent_attempt_id, attempts4.id) = attempts3.id
                                        LEFT JOIN results AS results4 ON results4.participant_id = attempts4.participant_id AND
                                                attempts4.id = results4.attempt_id AND results4.item_id = items4.id JOIN items_items AS items_items5 ON items_items5.parent_item_id = items4.id AND items_items5.child_item_id = 1625159049301502151 JOIN visible_items AS items5 ON items5.id = items_items5.child_item_id
                                JOIN attempts AS attempts5 ON attempts5.participant_id = 670968966872011405 AND
                                        (NOT items5.requires_explicit_entry OR attempts5.root_item_id = items5.id) AND
                                        IF(attempts5.root_item_id = items5.id, attempts5.parent_attempt_id, attempts5.id) = attempts4.id
                                        LEFT JOIN results AS results5 ON results5.participant_id = attempts5.participant_id AND
                                                attempts5.id = results5.attempt_id AND results5.item_id = items5.id JOIN items_items AS items_items6 ON items_items6.parent_item_id = items5.id AND items_items6.child_item_id = 5207993233955027100 JOIN visible_items AS items6 ON items6.id = items_items6.child_item_id
                                JOIN attempts AS attempts6 ON attempts6.participant_id = 670968966872011405 AND
                                        (NOT items6.requires_explicit_entry OR attempts6.root_item_id = items6.id) AND
                                        IF(attempts6.root_item_id = items6.id, attempts6.parent_attempt_id, attempts6.id) = attempts5.id
                                        LEFT JOIN results AS results6 ON results6.participant_id = attempts6.participant_id AND
                                                attempts6.id = results6.attempt_id AND results6.item_id = items6.id WHERE (items0.id = 694914435881177216) AND (items0.id IN ((SELECT groups.root_activity_id FROM `groups_ancestors_active` JOIN `groups` ON groups.id = groups_ancestors_active.ancestor_group_id WHERE (child_group_id = 670968966872011405)) UNION (SELECT groups.root_activity_id FROM `groups_ancestors_active`
                        JOIN group_managers
                                ON group_managers.group_id = `groups_ancestors_active`.ancestor_group_id
                        JOIN groups_ancestors_active AS group_ancestors
                                ON group_ancestors.ancestor_group_id = group_managers.manager_id AND
                                        group_ancestors.child_group_id = 670968966872011405 JOIN `groups` ON groups.id = groups_ancestors_active.child_group_id ) ) OR items0.id IN ((SELECT groups.root_skill_id FROM `groups_ancestors_active` JOIN `groups` ON groups.id = groups_ancestors_active.ancestor_group_id WHERE (child_group_id = 670968966872011405)) UNION (SELECT groups.root_skill_id FROM `groups_ancestors_active`
                        JOIN group_managers
                                ON group_managers.group_id = `groups_ancestors_active`.ancestor_group_id
                        JOIN groups_ancestors_active AS group_ancestors
                                ON group_ancestors.ancestor_group_id = group_managers.manager_id AND
                                        group_ancestors.child_group_id = 670968966872011405 JOIN `groups` ON groups.id = groups_ancestors_active.child_group_id ) )) AND ((NOT items0.requires_explicit_entry OR results0.attempt_id IS NOT NULL) AND (results0.started_at IS NOT NULL OR attempts0.ended_at IS NULL AND NOW() < attempts0.allows_submissions_until AND 1)) AND ((NOT items1.requires_explicit_entry OR results1.attempt_id IS NOT NULL) AND (results1.started_at IS NOT NULL OR attempts1.ended_at IS NULL AND NOW() < attempts1.allows_submissions_until AND attempts0.ended_at IS NULL AND NOW() < attempts0.allows_submissions_until AND 1)) AND ((NOT items2.requires_explicit_entry OR results2.attempt_id IS NOT NULL) AND (results2.started_at IS NOT NULL OR attempts2.ended_at IS NULL AND NOW() < attempts2.allows_submissions_until AND attempts1.ended_at IS NULL AND NOW() < attempts1.allows_submissions_until AND attempts0.ended_at IS NULL AND NOW() < attempts0.allows_submissions_until AND 1)) AND ((NOT items3.requires_explicit_entry OR results3.attempt_id IS NOT NULL) AND (results3.started_at IS NOT NULL OR attempts3.ended_at IS NULL AND NOW() < attempts3.allows_submissions_until AND attempts2.ended_at IS NULL AND NOW() < attempts2.allows_submissions_until AND attempts1.ended_at IS NULL AND NOW() < attempts1.allows_submissions_until AND attempts0.ended_at IS NULL AND NOW() < attempts0.allows_submissions_until AND 1)) AND ((NOT items4.requires_explicit_entry OR results4.attempt_id IS NOT NULL) AND (results4.started_at IS NOT NULL OR attempts4.ended_at IS NULL AND NOW() < attempts4.allows_submissions_until AND attempts3.ended_at IS NULL AND NOW() < attempts3.allows_submissions_until AND attempts2.ended_at IS NULL AND NOW() < attempts2.allows_submissions_until AND attempts1.ended_at IS NULL AND NOW() < attempts1.allows_submissions_until AND attempts0.ended_at IS NULL AND NOW() < attempts0.allows_submissions_until AND 1)) AND ((NOT items5.requires_explicit_entry OR results5.attempt_id IS NOT NULL) AND (results5.started_at IS NOT NULL OR attempts5.ended_at IS NULL AND NOW() < attempts5.allows_submissions_until AND attempts4.ended_at IS NULL AND NOW() < attempts4.allows_submissions_until AND attempts3.ended_at IS NULL AND NOW() < attempts3.allows_submissions_until AND attempts2.ended_at IS NULL AND NOW() < attempts2.allows_submissions_until AND attempts1.ended_at IS NULL AND NOW() < attempts1.allows_submissions_until AND attempts0.ended_at IS NULL AND NOW() < attempts0.allows_submissions_until AND 1)) AND ((NOT items6.requires_explicit_entry OR results6.attempt_id IS NOT NULL) AND (results6.started_at IS NOT NULL OR attempts6.ended_at IS NULL AND NOW() < attempts6.allows_submissions_until AND attempts5.ended_at IS NULL AND NOW() < attempts5.allows_submissions_until AND attempts4.ended_at IS NULL AND NOW() < attempts4.allows_submissions_until AND attempts3.ended_at IS NULL AND NOW() < attempts3.allows_submissions_until AND attempts2.ended_at IS NULL AND NOW() < attempts2.allows_submissions_until AND attempts1.ended_at IS NULL AND NOW() < attempts1.allows_submissions_until AND attempts0.ended_at IS NULL AND NOW() < attempts0.allows_submissions_until AND 1)) AND (results0.attempt_id IS NOT NULL OR attempts0.id = 0) ORDER BY ((results0.started_at IS NULL) << 6) + ((results1.started_at IS NULL) << 5) + ((results2.started_at IS NULL) << 4) + ((results3.started_at IS NULL) << 3) + ((results4.started_at IS NULL) << 2) + ((results5.started_at IS NULL) << 1) + ((results6.started_at IS NULL) << 0), attempts0.id DESC, attempts1.id DESC, attempts2.id DESC, attempts3.id DESC, attempts4.id DESC, attempts5.id DESC, attempts6.id DESC LIMIT 1 FOR UPDATE)

but that request is taking 0.2sec to run in the MySQL CLI. 🤷‍♂️

GeoffreyHuck commented 5 months ago

Here's the current investigations. For now, still no clues.

Note: The lock FOR UPDATE is questionable. Do we really need to lock the rows of all items in the path for ALL SELECT statements on items of other queries in the app while this query is running? It's possible to lock only for UPDATE with FOR SHARE. This would make other queries selecting items in the path return their result, and only block queries that want to update those items.

smadbe commented 5 months ago

Thanks for this analysis.

How do you explain the deadlock with the transaction? Anyway we should avoid having deadlocks at all cost (even if that's probably not the cause of the issue as the CPU is fully used during the 250s). So having some extra waiting because of that lock is normal but the deadlock is a bug.

The lock FOR UPDATE is questionable. Do we really need to lock the rows of all items in the path for ALL SELECT statements on items of other queries in the app while this query is running?

The INSERT (to create the attempt path) is not supposed to be a long query and it is important that no other attempt path is created at the same time, so letting the parallel start-result-path working on the same rows (so for the same user and item) wait looks like a good idea!

It's possible to lock only for UPDATE with FOR SHARE. This would make other queries selecting items in the path return their result, and only block queries that want to update those items.

If I understand well you proposal, it would mean that you may have this scenario if 2 service calls are done "at the same time": the service-call-1 would do the select, detect what is missing, then the service-call-2 would do the select, detect what is missing... and then service-call-1 would do the INSERT... followed by the service-call-2 doing the same insert, so we might duplicate and so issues.

So that seems important to understand the deadlock as this should not happen according to my understanding (so that means my understanding is wrong). I am not sure that is related with the 250s... but that's possible.

smadbe commented 5 months ago

For the deadlock again (and maybe for the timing): Probably you could try to remove the latest queries from your service to see if it changes something:

GeoffreyHuck commented 5 months ago

For now I put the lock on the side because it's not the cause, we can come back to it later.

I continued the investigations, and found something interesting:

Result when the query is constructed in the service (slow):

result: []map[string]interface {}{map[string]interface {}{"attempt_id0":0, "attempt_id1":0, "attempt_id2":0, "attempt_id3":0, "attempt_id4":0, "attempt_id5":0, "attempt_id6":0, "has_started_result0":1, "has_started_result1":1, "has_started_result2":1, "has_started_result3":1, "has_started_result4":1, "has_started_result5":1, "has_started_result6":1}}

We have numbers as a result.

Result when the query is pasted and executed in the service (fast):

result: []map[string]interface {}{map[string]interface {}{"attempt_id0":"0", "attempt_id1":"0", "attempt_id2":"0", "attempt_id3":"0", "attempt_id4":"0", "attempt_id5":"0", "attempt_id6":"0", "has_started_result0":"1", "has_started_result1":"1", "has_started_result2":"1", "has_started_result3":"1", "has_started_result4":"1", "has_started_result5":"1", "has_started_result6":"1"}}

We have strings as a result.

My best bet now is that it's due to the way the parameters are binded in the sql package, and that somehow the int64 are converted into strings, and that's what makes it very slow. But it's just a guess, I haven't investigated the idea further yet. I'll try next week.

Other things I tried: