Closed danielealbano closed 1 year ago
Patch coverage: 90.00
% and project coverage change: +0.08
:tada:
Comparison is base (
f54d217
) 78.03% compared to head (e588afa
) 78.12%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
This PR changes how cachegrand determines if a snapshot is being prepared to be generated, it switches from using the status field to an ad-hoc boolean field.
The reason behind this change is the current implementation, when a lot of worker runs in paralle, trigger the preparation code multiple time as consequence of how the status is swapped from the current one to the IN PREPARATION. To solve that issue would be necessary to do not read the current state and then change it to IN PREPARATION, instead it would be necessary to "expect" a very specific and precise value, but the status at that stage can potentially be NONE, COMPLETED or FAILED, which makes it challenging to reliabily identify the expected status.
Here an example of the bug in action
It's possible to see that the Snapshot started is reported 3 times.
To solve the issue a new boolean flag is introduced, called
in_preparation
, which complements therunning
boolean flag, and it's used to determine if a worker is preparing the snapshot to be generated. This approach guarantees that the expected status for the CAS operation has to befalse
and if the CAS operation it's because another worker set it totrue
.The fleg is set back to false, after the flag
running
is set to true, at the end of the preparation which ensures that the threads will start to safely see therunning
flag, which has precedence over the status check.This bug is only affecting deployments which are using plenty of workers/