BOINC / boinc

Open-source software for volunteer computing and grid computing.
https://boinc.berkeley.edu
GNU Lesser General Public License v3.0
1.99k stars 443 forks source link

Server: validator - assimilator race condition - should validator set assimilate_state? #5706

Open bema-aei opened 1 month ago

bema-aei commented 1 month ago

Occasionally Einstein@Home encounters the following situation:

Also, when setting assimilate_state, the validator doesn't care about results of the workunit that are still "in progress". Triggering assimilation (and subsequently file deletion) of a workunit immediately after finding a canonical result means that these "late" results can not be (successfully) validated, assimilated and credited even if these are valid and arrive on time, i.e. before their deadline.

I was about to modify the validator to not set assimilate_state directly, but just let the transitioner know to take a look at that workunit (immediately). It does handle such late results correctly and, and the transitioner doesn't have such a delay between reading the workuits and writing modifications as the validator has (at least occasionally).

But then I came across that comment in the validator:

https://github.com/BOINC/boinc/blob/master/sched/validator.cpp#L677-L679

                // if we found a canonical result,
                // trigger the assimilator, but do NOT trigger
                // the transitioner - doing so creates a race condition

What race condition is this comment referring to, and how would we resolve the problems above then?

brevilo commented 1 month ago

Note, you want to use permalinks when referring to code, so the link stays valid over time (i.e. in context with the issue). See the ellipsis in the upper-right corner of the file view.

brevilo commented 1 month ago

What race condition is this comment referring to

Does the actual commit help? Maybe even try and tap into Bruce's memory?

bema-aei commented 1 month ago

Thanks it does.

The interesting thing here is that the commit actually doesn't change the wu.assimilate_state = ASSIMILATE_READY; line, it is just left untouched. This means the code before that commit did both, set assimilate_state to trigger assimilation and trigger immediate transition. This will surely create a race condition.

Bruce was correct to avoid this, but I think he fixed it at the wrong end. IMHO immediate transition should be triggered here, and not assimilation directly.