smadbe commented 2 weeks ago

Preamble

I will not quote all discussions that we have in the last few months, but of course, there is a lot of context for this discussion
"propagation" used in general includes all group ancestors, item ancestors, permissions and result propagations
the list of propagations applied by service

Need for immediate change vs duration

some propagations may be quite long (let's say >1sec), some may not be long at all (e.g., just updating its own score does not cause thousands of entries to be updated)
some propagations has an effect on the current user, some does not: the user is more willing to waiting for the update in the first case

=> hopefully these 2 are not independent... typically those affecting other users are more likely to be long, but for those it is not a big deal if they are applied 20sec later

Not all services have the same impact

Some of the services are called a lot (>100x/sec), while some are called <10x per day by a moderator, or even <1x month by a super administrator.

group and item ancestor propagation

Requirements:

Internally: Inconsistency in the ancestors may have severe effect on the consistency of the data
UX: better if applied immediately as it would probably create inconsistency in what the services return Actual speed:
They should always be fast (??? correct ? or did we have slow item propagation?)

perm and result propagation

Permissions UX requirements:

changes affecting other users may always (?) be postponed to a second later without problems
perm change on oneself on the item we are working on (for instance when creating a new item) should be done immediately

Results Actual delay:

some changes (such as affecting the item hierarchy) may affect a large amount users and a large amount of items... so a huge amount of results
updating its own score on a single item should have a limited impact so should not be too long

UX requirements:

updating its own score on a single item may unlock content... ideally the UI should know immediately about that
update on other users can always (?) wait

Per services

Services with low/medium frequency and limited effect on propagation (so the propagation caused should be very short)

Those affecting the group membership of 1 user. Result propagation is run as as it may enable visibility for the user to a new item subtree... and so requiring to compute the results for this new subtree.

groupInvitationAccept
groupJoinRequestCreate
groupsJoinByCode
groupJoinRequestsAccept
groupLeaveRequestsAccept
groupInvitationsCreate
groupLeave
userDataRefresh
accessTokenCreate
itemEnter

Those affecting the result of 1 item and 1 participant, so that may require propagation to "ancestor results" but no unlock:

itemGetAnswerToken
itemGetHintToken
attemptCreate

Those affecting the result of 1 item and 1 participant but that may cause unlock:

saveGrade (note that we will probably need soon to know immediately in response if it has caused unlock)

Those which remove group for users... which should not change anything for results / perm:

groupMembersRemove
groupRemoveChild
groupDelete
groupUpdate

All these should probably do their propagation SYNC (providing they are not applying propagation from other services of course)

Service with (very) high call frequency and limited impact

itemTaskTokenGenerate: affect 1 participant on 1 item and does not trigger unlock

resultStart/resultStartPath: affect 1 participant on 1 item and does not trigger unlock

These should probably do their propagation SYNC (providing they are not applying propagation from other services of course). But in addition they might need an "customized" (lightweight) propagation algorithm knowing their needs (for instance, we know they cannot trigger unlock and do simple changes to result ancestors)

Service with a (possibly) larger propagation radius

Those adding multiple users to a group

contestSetAdditionalTime
groupAddChild

For these 2 services:

The result propagation comes that adding a participant in a group may give him visibility to a new subtree of items, so which may require computing results on them. The permission propagation would come from possible unlock from that previous operation, which is very unlikely. So even if it will probably not, it may be a long propagation causing a service timeout.
Typically the current user is giving perm to other users. In such a case, seeing the outcome of the change immediately does not matter.

=> so the result/permission propagation should probably be async here

Those affecting permissions

updatePermissions
itemDependencyApply

Affecting permission means many permissions may need to be recomputed. Per se, it cannot probably be that long, but that may retrigger a result propagation (for the same reason as for the previous services about groups) which may take time. So probably it requires async propagation.

Those affecting item structure

itemCreate
itemDelete
itemUpdate

These services affect the results of many (possibly all) users on possibly a high item hierarchy depth. This may cause unlocking which may trigger permission propagation as well. These services are the main (only?) source of problems. So they need to run their propagation async.

But but... these are also affecting the current user:

When creating an item, the user gets owner permission on the item (which require propagation to be effective) and typically the first thing the user wants to do is to edit the item, but currently, it may not be able to see the title of the element he has just created because of propagation hasn't run yet.. Problem we will need to fix very soon
The score of the parent chapter may be immediately impacted, probably it is clearer for the user if that score is updated immediately in sync. (Lower priority)

=> so probably the propagation impacting the current user have to be applied in sync, the other asynchronously.

Conclusions

most services could probably go back to full sync propagation
a few (3.. 2 mainly actually) services may require a specific optimized sync result propagation as they are called very often and they only need a small subset of the propagation part
there is still async result propagation needed for some services
the services running their propagation synchronously must have a way to just run their part of the propagation
a few services needs a mixed of sync+async result propagation
the saveGrade will probably need to return what has been unlocked as an effect of its propagation

zenovich commented 6 days ago

They should always be fast (??? correct ? or did we have slow item propagation?)

It's fast. I've tried this test on the anonymized DB:

delete from items_ancestors; Query OK, 26898 rows affected (2.38 sec)
insert into items_propagate select id, 'todo' from items on duplicate key update ancestors_computation_state='todo'; Query OK, 8808 rows affected (0.41 sec) Records: 4404 Duplicates: 4404 Warnings: 0
Comment out all the propagations from the db-recompute command except for the one related to items ancestors and build the app.
Run time ./bin/AlgoreaBackend db-recompute <env_name> The result is 4.371s.

Note, that the test simulates an unreal situation where we need to insert all the items_ancestors from scratch. It's the absolutely worst case, impossible in the prod. Still, it takes only 4 seconds (including loading the environment and running two transactions, the first one is dummy). So we can be sure it's always possible to run the items ancestors recalculation synchronously and inside the main transaction as we used to do and as we do for the groups ancestors (the same algorithm and even the same method in the code).

After moving it back into a single transaction and adding some optimizations it takes only 2 seconds:

time ./bin/AlgoreaBackend db-recompute full
Loading environment: full
Running ItemItemStore.CreateNewAncestors()
DONE
./bin/AlgoreaBackend db-recompute full  0.01s user 0.01s system 1% cpu 2.132 total

smadbe commented 5 days ago

They should always be fast

It's fast. I've tried this test on the anonymized DB:

From our discussion on slack: This answer only applies to the item case of the question, group ancestor propagation is not necessarily fast.

France-ioi / AlgoreaBackend

(re)Organize sync vs async propagation: actual delay vs requirements analysis #1181