fedimint / fedimint

Federated E-Cash Mint
https://fedimint.org/
MIT License
536 stars 209 forks source link

fix: recovery waiting for the last session to close (mostly in tests) #4148

Closed dpc closed 2 months ago

dpc commented 3 months ago

Use https://github.com/fedimint/fedimint/pull/4150 to avoid waiting for the final session to finish.

Fix #4043

dpc commented 3 months ago

@elsirion @joschisan will keep it as a draft, so pinging you as you might not see it otherwise

codecov[bot] commented 3 months ago

Codecov Report

Attention: 81 lines in your changes are missing coverage. Please review.

Comparison is base (52f0c96) 58.06% compared to head (5a34de9) 57.82%. Report is 102 commits behind head on master.

:exclamation: Current head 5a34de9 differs from pull request most recent head 53f883b. Consider uploading reports for the commit 53f883b to get more accurate results

Files Patch % Lines
fedimint-core/src/api.rs 6.77% 55 Missing :warning:
fedimint-client/src/module/init/recovery.rs 0.00% 26 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #4148 +/- ## ========================================== - Coverage 58.06% 57.82% -0.24% ========================================== Files 197 193 -4 Lines 43755 43446 -309 ========================================== - Hits 25405 25123 -282 + Misses 18350 18323 -27 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

elsirion commented 3 months ago

Since old federations don't have this API endpoint this seems like a minor API version change? We should detect this, only use it if available and mark the code path in case it's unavailable as deprecated for future removal.

elsirion commented 3 months ago

If get_session_outcome_or_pending_items returned a tuple (current_session_count, pending_items), there would be no need to do a double session-count call before and after.

Good idea, we can mess around with the API on master till we cut a release (we should bump the API version anyway, I forgot that in the original PR).

joschisan commented 3 months ago

The problem is that under load pending_accepted_items may never return via request_current_consensus, the endpoint was not well thought out on my part. I would propose the following for recovery. instead of the two endpoints get_session_outcome and pending_accepted_items I would create an endpoint called current_accepted_items that takes a session index as an argument and returns a Vec for the given session_index. The endpoint does the following: check if the index matches the current session index, if so return the session outcomes items as a vector; then check if the index matches the current session index, if so return the pending accepted items as a vector. Otherwise, the requested index is higher than the current sessions index at which point we either return and Error, None or an empty vector.

We can now use this endpoint for recovery by checking the current session index first and then using current_accepted_items to obtain the history. For indices smaller then the current session index we will obtain the same result as await_session_outcome. In the current session we will either obtain the outcome directly if it has been completed while we downloaded the history, or we try to obtain the pending accepted items via request_current_consensus - however if we are under load we the query strategy will retry the current_accepted_items endpoint until the session completes and we can establish consensus.

Please find the endpoints implementation here: https://github.com/fedimint/fedimint/pull/4150

Benefit of this is that the recovery uses both endpoints await_session_outcome and current_accepted_items so the code can just default to old endpoint if the new is not available with out any change in logic, it may just be slower.

Theoretically we could also just change the old endpoints behaviour and be done with it would it would be very confusingly named.

joschisan commented 3 months ago

I was just thinking maybe we should postpone this to 0.4? Since sessions are at most a minute and will stay that way for 0.3 this I only a little annoying in testing if I understand correctly @dpc? Seems to only really be a problem if sessions get longer… I was considering longer sessions for 0.4 with a synchronised update then we could still improve this. Would give us more time to think about this since endpoints are a little annoying to update.

dpc commented 3 months ago

Please find the endpoints implementation here: https://github.com/fedimint/fedimint/pull/4150

Commented, but generally :+1: approach.

Since old federations don't have this API endpoint this seems like a minor API version change?

That's a good point.

I was just thinking maybe we should postpone this to 0.4?

I don't want to dwell too long and postpone things. We have gazillion other things to work on, so IMO we should do the best we can think of and move on. #4150 with that "closed vs not" flag would be perfect from all angles I can see.

Let's land the improved #4150 if seems right, I will add the code to make that code conditional based on negotiated API version (if easy enough), or I'll just make the code fall back to await_session_outcome if the current_accepted_items fails for any reason, which will make the test go brrrr, simplify all the complications here and work even on old Federations.

dpc commented 3 months ago

@elsirion @joschisan

Depending on how we go about https://github.com/fedimint/fedimint/pull/4150 it I'll have to change this PR in different ways.

I'd like to go with #4150, but make it return (Vec<AcceptedItem>, bool) to tell the client if the session being returned is closed/open, to make caching possible. Then I can just use that for everything, and get rid of bunch of complexity and workarounds I added here, without loosing anything.

joschisan commented 3 months ago

I'd like to go with #4150, but make it return (Vec<AcceptedItem>, bool) to tell the client if the session being returned is closed/open, to make caching possible.

How about we use a SessionStatus enum to return three variants, Complete, Pending and Initial? That would be more expressive. With the bool we can not cover the third case..

dpc commented 3 months ago

What's an Initial? :D

joschisan commented 3 months ago

What's an Initial? :D

Means the session is not yet started.

dpc commented 3 months ago

@joschisan Oh, right, that's useful. In that case it's a bit of a confusing name. Nothing initial about something that doesn't exist yet... ? A None, Missing, NotFound or something maybe? But a detail anyway, can be tweaked later.

Now that I thought about it... even more of a detail... maybe enum variants should be reversed? :D So in the encoding the session status beings with 0 (not opened yet), moves to 1 (in progress), 2 (sealed). :D

Happy to alter my PR on top of it tomorrow one way or the other.

dpc commented 3 months ago

So much better now. :)

justinmoon commented 3 months ago

dev call: merge conflict and we need another review

justinmoon commented 3 months ago

re-approving on @elsirion's behalf

justinmoon commented 2 months ago

Has approvals but CI failing .. re-running

dpc commented 2 months ago

Merge conflict, rebased.