Rely on a bool flag to indicate the in preparation status of the snapshots

danielealbano / cachegrand

cachegrand - a modern data ingestion, processing and serving platform built for today's hardware

BSD 3-Clause "New" or "Revised" License

975 stars 34 forks source link

This PR changes how cachegrand determines if a snapshot is being prepared to be generated, it switches from using the status field to an ad-hoc boolean field.

The reason behind this change is the current implementation, when a lot of worker runs in paralle, trigger the preparation code multiple time as consequence of how the status is swapped from the current one to the IN PREPARATION. To solve that issue would be necessary to do not read the current state and then change it to IN PREPARATION, instead it would be necessary to "expect" a very specific and precise value, but the status at that stage can potentially be NONE, COMPLETED or FAILED, which makes it challenging to reliabily identify the expected status.

Here an example of the bug in action

[2023-04-17T09:30:04Z][INFO       ][worker][id: 28][cpu: 28][storage_db_snapshot] Snapshot started
[2023-04-17T09:30:04Z][INFO       ][worker][id: 44][cpu: 44][storage_db_snapshot] Snapshot progress <100.00%>, keys processed <0>, data written <0.00 MB>, eta: <0 seconds>
[2023-04-17T09:30:04Z][INFO       ][worker][id: 44][cpu: 44][storage_db_snapshot] Snapshot completed in <49 ms>
[2023-04-17T09:30:09Z][INFO       ][worker][id: 31][cpu: 31][storage_db_snapshot] Snapshot progress <0.00%>, keys processed <0>, data written <0.00 MB>, eta: <0 seconds>
[2023-04-17T09:30:09Z][INFO       ][worker][id: 29][cpu: 29][storage_db_snapshot] Snapshot started
[2023-04-17T09:30:09Z][INFO       ][worker][id: 14][cpu: 14][storage_db_snapshot] Snapshot started

It's possible to see that the Snapshot started is reported 3 times.

To solve the issue a new boolean flag is introduced, called in_preparation, which complements the running boolean flag, and it's used to determine if a worker is preparing the snapshot to be generated. This approach guarantees that the expected status for the CAS operation has to be false and if the CAS operation it's because another worker set it to true.

The fleg is set back to false, after the flag running is set to true, at the end of the preparation which ensures that the threads will start to safely see the running flag, which has precedence over the status check.

This bug is only affecting deployments which are using plenty of workers/

Codecov Report

Patch coverage: 90.00% and project coverage change: +0.08 :tada:

Comparison is base (f54d217) 78.03% compared to head (e588afa) 78.12%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #405 +/- ## ========================================== + Coverage 78.03% 78.12% +0.08% ========================================== Files 194 194 Lines 13275 13262 -13 ========================================== + Hits 10359 10360 +1 + Misses 2916 2902 -14 ``` | Flag | Coverage Δ | | |---|---|---| | unittests | `78.12% <90.00%> (+0.08%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Daniele+Salvatore+Albano#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://app.codecov.io/gh/danielealbano/cachegrand/pull/405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Daniele+Salvatore+Albano) | Coverage Δ | | |---|---|---| | [src/storage/db/storage\_db.h](https://app.codecov.io/gh/danielealbano/cachegrand/pull/405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Daniele+Salvatore+Albano#diff-c3JjL3N0b3JhZ2UvZGIvc3RvcmFnZV9kYi5o) | `27.27% <ø> (ø)` | | | [src/storage/db/storage\_db\_snapshot.c](https://app.codecov.io/gh/danielealbano/cachegrand/pull/405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Daniele+Salvatore+Albano#diff-c3JjL3N0b3JhZ2UvZGIvc3RvcmFnZV9kYl9zbmFwc2hvdC5j) | `42.76% <88.89%> (+0.69%)` | :arrow_up: | | [src/storage/db/storage\_db.c](https://app.codecov.io/gh/danielealbano/cachegrand/pull/405?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Daniele+Salvatore+Albano#diff-c3JjL3N0b3JhZ2UvZGIvc3RvcmFnZV9kYi5j) | `61.93% <100.00%> (+0.05%)` | :arrow_up: | ... and [4 files with indirect coverage changes](https://app.codecov.io/gh/danielealbano/cachegrand/pull/405/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Daniele+Salvatore+Albano)

danielealbano / cachegrand

Rely on a bool flag to indicate the in preparation status of the snapshots #405

Codecov Report