fedimint / fedimint

Federated E-Cash Mint
https://fedimint.org/
MIT License
536 stars 210 forks source link

Rearchitect modularized backup recovery #2977

Closed elsirion closed 3 months ago

elsirion commented 9 months ago

Edit: TODO checklist:

Previous discussion

In the design described below a badly implemented module could stall the recovery of all other modules and we decided against it. I leave it here for historic context. For the actual proposal see https://github.com/fedimint/fedimint/issues/2977#issuecomment-1819626967

OUTDATED:

The current recovery interface was built in a hurry to bring the new client library to feature-parity with the old one so that a switch could happen. Unfortunately this lead to a design with multiple shortcomings mostly stemming from the fact that recovery is modeled as a module state machine and the different modules cannot easily interact.

Proposed Architecture

I already talked to @dpc about the following new architecture:

This design solves the problems mentioned above:

justinmoon commented 9 months ago
elsirion commented 9 months ago

What happens if client is restarted during recovery

Good that you bring that up, maybe the default build method would return some enum:

enum ClientOrRecovering {
    Initialized(Client),
    Recovering(RecoveringClient),
}

while the recovering build fn always returns a RecoveringClient?

EDIT: and yes, you'd resume from the last checkpoint.

justinmoon commented 6 months ago

Dev call: this description is a little outdated

dpc commented 5 months ago

New design:

elsirion commented 5 months ago

Some more ideas:

(Old notes: https://github.com/fedimint/fedimint/pull/3008#issuecomment-1791188772)

dpc commented 5 months ago
  • Separate init fn on ClientModuleInit trait that return client module and recovery SM to be spawned (idk if there's a more elegant way now)

We can just pass snapshot: Option<&[u8]> to init I guess...

dpc commented 5 months ago

I'm tempted to have async fn recover(&self, args: &ClientModuleInitArgs<Self>) -> anyhow::Result<Self::Recovery> and a whole new trait trait ClientModuleRecovery with something like;

trait ClientModuleRecovery {

  async fn make_progress(self) -> Option<Self>;

  fn progress(&self) -> f32;

}

and track them separately. Otherwise it will just pollute the normal client with an extra global mode everywhere.

@elsirion

elsirion commented 5 months ago

We can just pass snapshot: Option<&[u8]> to init I guess...

How would we distinguish "recover without backup" and "no recovery"?

elsirion commented 5 months ago

By having a make_progress fn, do you want to implement an entirely new mechanism of driving recovery forward? As I understood it we'd let the SM executor do most of the heavy lifting and just add methods to track progress.

dpc commented 5 months ago

By having a make_progress fn, do you want to implement an entirely new mechanism of driving recovery forward? As I understood it we'd let the SM executor do most of the heavy lifting and just add methods to track progress.

I'm thinking: Recoveries don't have conditionals and diverging states, so they don't need fancy state machines. We keep pulling make_progress and savekeep the progress in the db until None is returned and recovery is complete. After that the module can be ClientModuleInit::inited.

elsirion commented 5 months ago

But we don't want a separate high-level struct, i.e. RecoveringClient, right (that's what I first intended, but you had good arguments against that)? So Client would have a map of initialized modules and another with recovering ones?

dpc commented 5 months ago

So Client would have a map of initialized modules and another with recovering ones?

Yeah.

Recovering client module is really not client module. It can't do anything, so all methods would require a conditional returning "I'm not ready" etc. @elsirion

dpc commented 4 months ago

Pasting from a post I made earlier:

I think I'm going to have have the modules use their own database to preserve the recovery state as they please. They can use own one key to store the state and encode/decode it (like in the existing mint recovery code does) if they want, but also store it in multiple keys or do whatever else they want. Passing encodable state back and forth is a drag introducing extra type-system conversions etc and overhead of passing things back and forth. The original mint client recovery was written in the current style because it had to fit the state machines, but this is no longer a constrain. In certain situations it might not even be desirable/possible to fit all the recovery state in the the memory and pass it around.

Then I realized that having make_progress is also kind of limiting and pointless, and I'll just use tokio::sync::watch or something so that recovering module can send updates whenever they feel like it, and don't have to interrupt execution to return "progress". This makes things like pre-fetching, buffering, using streams much easier as the recovery code is just unconstrained future running to completion.

So all in all instead of ClientRecovery::make_progress, the global recovery code will be callingClientModuleInit::recoverwith someRecoveryArgson each start until that future completes without error (completed), which will be recorded in persisted store so on next startClientModulInit::init` can be called instead.

and then @elsirion voicing concerns:

I feel like FM should at least provide a common way/framework to build resilient backup recovery. Maybe we give full control to the module authors to do it themselves, but supply a generic function that lets modules define a recovery state machine and runs it, persisting and announcing progress regularly.

to which I would say that structurally module implementations messing their implementations is not a problem (doesn't block or break anything else in their Client), but indeed I'd love to have helpers for that.

Currently I'm planning to give the fn ModuleInit::recovery some RecoveryArgs very much like InitArgs and other than the normal accessors, it can have one or more "given this single step logic, returning this Encodable state, make the recovery happen correctly"

justinmoon commented 3 months ago

🎉