fedimint / fedimint

Federated E-Cash Mint
https://fedimint.org/
MIT License
571 stars 223 forks source link

Client state machine failure and retry abstraction #3530

Open elsirion opened 10 months ago

elsirion commented 10 months ago

While working on #3516 and after talking with @joschisan about #3524 I come to the conclusion that we want an abstraction for retry behavior of failures in state machines so that not every one has to implement it in a custom way. I see three options:

Out of these three options Option 2 appears to be the best to @joschisan and me, but more input is welcome.

joschisan commented 10 months ago

One thing to add to option 1 and 2 is that methods that use the transaction update stream, like await_tx_accepted, would need a way to notice that the tx submission state machine has entered the failed state. When an update stream is used in a state machines state transition trigger future the state machine would have to enter the failed itself. We could close the stream to signal this but this seems easy to mess up, so we could return a Result instead of TxSubmissionStates to signal that the tx submission state machine has failed here.

dpc commented 10 months ago

Re https://github.com/fedimint/fedimint/issues/3472

dpc commented 10 months ago

I've been thinking about #3472 since it was created, so my PoV is skewed with addressing it, but I think they are parts of the same issue.

I'd start with the fact that since we paid the async tax, we can reap the benefits. Trigger futures are explicitly idempotent and cancellable, so the executor can easily and naturally take more active role in scheduling, as it has a better view of the overall state of system.

The executor can and should (maybe does already) track things like State Machine / State creation time, time it was last attempted, total time it was actually running, etc., so it can make decisions like: "this is newly created, so probably should be driven immediately; while these were running for 30 minutes already, so let's try them in 30 minutes when we have some slots free".

Executor should even actively cancel some futures that were running for too long to make room for other ones, etc. trying to keep a lid on resources used, taking into account which things are urgent, which are not. Basically - we should turn it into a proper scheduler. We can start simple of course, and optimize as get a better idea of what we need. I could imagine even some kind of "scheduling hints" that Futures could set to help the Scheduler handle each State Machine in a use-case optimized ways, priorities, and whole thing. Of course - if it turns out ever necessary. For start we probably need only some ability to sort by some combination of "last tried" "overall age" to prioritize new things and things that probably should get a chance to retry, and then take N of them, and run only that subset.

If the State Machines require permanent failure, they should handle it themselves. Scheduler should treat all returned errors as retriable, and a scheduling hint: "this thing couldn't make progress right now, put it on a backburner, try later, etc. This will allow simple implementations that just try, fail and get restarted in a adaptive way. Futures that want to, can have their own implementation of retries.

Scheduler should expose "runtime stats" (real time duration, total runtime, number of other futures, etc.) in some context struct, and pass them to trigger functions. These context structs will be a required argument to pass to query strategies (more on it later), but also let implementations that require more custom retries to make more informed decisions, by consulting their stats.

To address the #3472 specifically, the "runtime stats" context struct would be passed to each query and the query implementation will use it to avoid subscribing and adjusting timeouts in smarter ways. To recap - "if a LN payment did not work out in the first 30 minutes, there's no point in wasting resources subscribing to notifications; just check and exit immediately". Would be nice if we didn't need to pass this context manually, but I doubt there's a good way to have non-global implicit future-specific context managed by... tokio executor or whatever else. Passing it manually doesn't seem like a big deal.

@joschisan @elsirion

dpc commented 10 months ago

@elsirion Making sure you've seen it.

elsirion commented 10 months ago

Sounds like a cool optimization that can be added later on by making the stats available via the GlobalContext?

dpc commented 10 months ago

Initial minimal implementation would just mean letting trigger futures return Result, treat Err as "retry me later", and implement some form of "exponential sleep & retry".

Next small step would be to maybe to put some medium-long timeout on every future, just to prevent them all from holding to jsonrpc slots for too long.

Next small step would be this context struct and passing it to query strategies to auto-cap (turn off) waiting and turn it into polling. But now that I wrote it - maybe it's unnecessary. Maybe deadline on the whole future runtime is enough at all times anyway.

Then in distant future the basic "exponential sleep & retry" could become smarter.