France-ioi / AlgoreaBackend

Backend for the new Algorea platform
MIT License
2 stars 2 forks source link

(re)Organize sync vs async propagation: actual delay vs requirements analysis #1181

Open smadbe opened 2 weeks ago

smadbe commented 2 weeks ago

Preamble

Need for immediate change vs duration

=> hopefully these 2 are not independent... typically those affecting other users are more likely to be long, but for those it is not a big deal if they are applied 20sec later

Not all services have the same impact

Some of the services are called a lot (>100x/sec), while some are called <10x per day by a moderator, or even <1x month by a super administrator.

group and item ancestor propagation

Requirements:

perm and result propagation

Permissions UX requirements:

Results Actual delay:

UX requirements:

Per services

Services with low/medium frequency and limited effect on propagation (so the propagation caused should be very short)

Those affecting the group membership of 1 user. Result propagation is run as as it may enable visibility for the user to a new item subtree... and so requiring to compute the results for this new subtree.

Those affecting the result of 1 item and 1 participant, so that may require propagation to "ancestor results" but no unlock:

Those affecting the result of 1 item and 1 participant but that may cause unlock:

Those which remove group for users... which should not change anything for results / perm:

All these should probably do their propagation SYNC (providing they are not applying propagation from other services of course)

Service with (very) high call frequency and limited impact

itemTaskTokenGenerate: affect 1 participant on 1 item and does not trigger unlock

resultStart/resultStartPath: affect 1 participant on 1 item and does not trigger unlock

These should probably do their propagation SYNC (providing they are not applying propagation from other services of course). But in addition they might need an "customized" (lightweight) propagation algorithm knowing their needs (for instance, we know they cannot trigger unlock and do simple changes to result ancestors)

Service with a (possibly) larger propagation radius

Those adding multiple users to a group

For these 2 services:

=> so the result/permission propagation should probably be async here

Those affecting permissions

Affecting permission means many permissions may need to be recomputed. Per se, it cannot probably be that long, but that may retrigger a result propagation (for the same reason as for the previous services about groups) which may take time. So probably it requires async propagation.

Those affecting item structure

These services affect the results of many (possibly all) users on possibly a high item hierarchy depth. This may cause unlocking which may trigger permission propagation as well. These services are the main (only?) source of problems. So they need to run their propagation async.

But but... these are also affecting the current user:

=> so probably the propagation impacting the current user have to be applied in sync, the other asynchronously.

Conclusions

zenovich commented 6 days ago

They should always be fast (??? correct ? or did we have slow item propagation?)

It's fast. I've tried this test on the anonymized DB:

  1. delete from items_ancestors; Query OK, 26898 rows affected (2.38 sec)
  2. insert into items_propagate select id, 'todo' from items on duplicate key update ancestors_computation_state='todo'; Query OK, 8808 rows affected (0.41 sec) Records: 4404 Duplicates: 4404 Warnings: 0
  3. Comment out all the propagations from the db-recompute command except for the one related to items ancestors and build the app.
  4. Run time ./bin/AlgoreaBackend db-recompute <env_name> The result is 4.371s.

Note, that the test simulates an unreal situation where we need to insert all the items_ancestors from scratch. It's the absolutely worst case, impossible in the prod. Still, it takes only 4 seconds (including loading the environment and running two transactions, the first one is dummy). So we can be sure it's always possible to run the items ancestors recalculation synchronously and inside the main transaction as we used to do and as we do for the groups ancestors (the same algorithm and even the same method in the code).

After moving it back into a single transaction and adding some optimizations it takes only 2 seconds:

time ./bin/AlgoreaBackend db-recompute full
Loading environment: full
Running ItemItemStore.CreateNewAncestors()
DONE
./bin/AlgoreaBackend db-recompute full  0.01s user 0.01s system 1% cpu 2.132 total
smadbe commented 5 days ago

They should always be fast

It's fast. I've tried this test on the anonymized DB:

From our discussion on slack: This answer only applies to the item case of the question, group ancestor propagation is not necessarily fast.