eBay / HomeStore

Storage Engine for block and key/value stores.
Apache License 2.0
23 stars 21 forks source link

Fix grpc crash #595

Closed Besroy closed 3 days ago

Besroy commented 1 week ago

If data channel and raft channel release the same data at the same time, there is a chance that send_response may be called twice, then grpc will crash with error TOO_MANY_OPERATIONS. This pr make sure data channel release data at first.

Signed-off-by: Yawen Zhang yawzhang@ebay.com

codecov-commenter commented 1 week ago

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 66.65%. Comparing base (1a0cef8) to head (ee90d97). Report is 95 commits behind head on master.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #595 +/- ## =========================================== + Coverage 56.51% 66.65% +10.13% =========================================== Files 108 109 +1 Lines 10300 10771 +471 Branches 1402 1470 +68 =========================================== + Hits 5821 7179 +1358 + Misses 3894 2887 -1007 - Partials 585 705 +120 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Besroy commented 1 week ago

now, before we append log, we will make sure that the data corresponding to this log is flushed and the rreq state is changed to DATA_WRITTEN.

so I am think can we remove the release data op in commit phrase?

I think remove release data in req->clear() is a strength way to avoid double release --- actually my fix cannot guarantee the sequence 'data channel release -> append log->commit release', because I found another conflict:

  1. data channel set DATA_WRITTEN at https://github.com/eBay/HomeStore/blob/master/src/lib/replication/repl_dev/raft_repl_dev.cpp#L456 at first
  2. check data received or not before append logs: if state==DATA_WRITTEM, will return directly at https://github.com/eBay/HomeStore/blob/master/src/lib/replication/repl_dev/raft_repl_dev.cpp#L545
  3. data channel release_data && handle_commit release_data at the same time

And other way to guarantee the sequence is to add lock on release_data. cc: @JacksonYao287 @xiaoxichen @raakella1

xiaoxichen commented 1 week ago

@Besroy i think you mean here https://github.com/eBay/HomeStore/blob/f83679af8ce71e71fb5b6f57df6d54047cd7d940/src/lib/replication/repl_dev/raft_repl_dev.cpp#L875

There are 3 places we do release_data()

  1. on_push_data_reveive()
  2. handle_fetch_data_response()
  3. on_commit()

conflicts between 1/3 and 2/3 can be resolved by either remove 3, or finish release_data in data path before setting the DATA_WRITTEN flag as well as the promise.

Conflict between 1/2 will not happen as add_state_if_not_already is atomic, the latter one will return and not proceed to write phase. However there seems an issue in push_data path, we dont send response in this case, @raakella1 pls take a look as well.