eBay / HomeStore

Storage Engine for block and key/value stores.
Apache License 2.0
23 stars 21 forks source link

Implement GC_REPL_REQ Based on DSN to Prevent Resource Leaks #576

Closed xiaoxichen closed 3 weeks ago

xiaoxichen commented 1 month ago

This commit introduces a mechanism to garbage collect (GC) replication requests (rreqs) that may hang indefinitely, thereby consuming memory and disk resources unnecessarily. These rreqs can enter a hanging state under several circumstances, as outlined below:

  1. Scenario with Delayed Commit:

    • Follower F1 receives LSN 100 and DSN 104 from Leader L1 and takes longer than the raft timeout to precommit/commit it.
    • L1 resends LSN 100, causing F1 to fetch the data again. Since LSN 100 was committed in a previous attempt, this log entry is skipped, leaving the rreq hanging indefinitely.
  2. Scenario with Leader Failure Before Data Completion:

    • Follower F1 receives LSN 100 from L1, but before all data is fetched/pushed, L1 fails and L2 becomes the new leader.
    • L2 resends LSN 100 with L2 as the new originator. F1 proceeds with the new rreq and commits it, but the initial rreq from L1 hangs indefinitely as it cannot fetch data from the new leader L2.
  3. Scenario with Leader Failure After Data Write:

    • Follower F1 receives data (DSN 104) from L1 and writes it. Before the log of LSN 100 reaches F1, L1 fails and L2 becomes the new leader.
    • L2 resends LSN 100 to F1, and F1 fetches DSN 104 from L2, leaving the original rreq hanging.

This garbage collection process cleans up based on DSN. Any rreqs in m_repl_key_req_map, whose DSN is already committed (rreq->dsn < repl_dev->m_next_dsn), will be GC'd. This is safe on the follower side, as the follower updates m_next_dsn during commit. Any DSN below cur_dsn should already be committed, implying that the rreq should already be removed from m_repl_key_req_map.

On the leader side, since m_next_dsn is updated when sending out the proposal, it is not safe to clean up based on m_next_dsn. Therefore, we explicitly skip the leader in this GC process.

codecov-commenter commented 1 month ago

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 51.35135% with 18 lines in your changes missing coverage. Please review.

Project coverage is 67.28%. Comparing base (1a0cef8) to head (8ccd8ee). Report is 85 commits behind head on master.

Files with missing lines Patch % Lines
src/lib/replication/repl_dev/raft_repl_dev.cpp 45.45% 17 Missing and 1 partial :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #576 +/- ## =========================================== + Coverage 56.51% 67.28% +10.77% =========================================== Files 108 109 +1 Lines 10300 10677 +377 Branches 1402 1459 +57 =========================================== + Hits 5821 7184 +1363 + Misses 3894 2801 -1093 - Partials 585 692 +107 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.