Client state machine failure and retry abstraction

elsirion commented 1 year ago

While working on #3516 and after talking with @joschisan about #3524 I come to the conclusion that we want an abstraction for retry behavior of failures in state machines so that not every one has to implement it in a custom way. I see three options:

Option 1: Trigger futures return Result<T, impl ToString> and any errors returned lead to the SM being marked as failed without an explicit Failed state and the error is saved with it.
- Problems:
- More complexity in how we save states
- Accidentally returning an error can happen easily with ? and it ends the state machine
- Benefits:
- No Failed state in every single state machine anymore
Option 2: Extension of 1, but trigger futures return Result<T, MaybeRetrayble> and the SM executor itself drives the retry behavior. MaybeRetryable is a struct containing an anyhow::Error and a bool flag indicating its retryability. From implementations for common error types can be provided so the succinct ? syntax can be used in most places. For all other cases there should be a retryable and a non-retryable error constructor.
- Problems:
- More complexity in how we save states
- Cleaner abstraction for retryable errors needed for this to work well?
- Only one way of handling retrying -> custom retry behavior ends up in trigger futures again (if ever needed)
- Benefits:
- No Failed state in every single state machine anymore
- No retry implementations in trigger futures
- Executor could halt retries if the client is offline #3496
Option 3: No handling of failures by the executor, but we provide a query strategy or function for infinite retrying, in case of the fn the signature would roughly look like this:
```
fn retry_if_possible<T, E>(
query: impl Fn() -> impl Future<Output = Result<T, E>>,
retry_filter: impl Fn(E) -> bool,
delay: Duration, // Or even some strategy for e.g. exponential backoff
) -> Result<T, E>
```
- Problems:
- Still need a failed state for each SM
- Benefits:
- No custom retry implementations in trigger futures

Out of these three options Option 2 appears to be the best to @joschisan and me, but more input is welcome.

joschisan commented 1 year ago

One thing to add to option 1 and 2 is that methods that use the transaction update stream, like await_tx_accepted, would need a way to notice that the tx submission state machine has entered the failed state. When an update stream is used in a state machines state transition trigger future the state machine would have to enter the failed itself. We could close the stream to signal this but this seems easy to mess up, so we could return a Result instead of TxSubmissionStates to signal that the tx submission state machine has failed here.

dpc commented 1 year ago

Re https://github.com/fedimint/fedimint/issues/3472

dpc commented 1 year ago

I've been thinking about #3472 since it was created, so my PoV is skewed with addressing it, but I think they are parts of the same issue.

I'd start with the fact that since we paid the async tax, we can reap the benefits. Trigger futures are explicitly idempotent and cancellable, so the executor can easily and naturally take more active role in scheduling, as it has a better view of the overall state of system.

The executor can and should (maybe does already) track things like State Machine / State creation time, time it was last attempted, total time it was actually running, etc., so it can make decisions like: "this is newly created, so probably should be driven immediately; while these were running for 30 minutes already, so let's try them in 30 minutes when we have some slots free".

Executor should even actively cancel some futures that were running for too long to make room for other ones, etc. trying to keep a lid on resources used, taking into account which things are urgent, which are not. Basically - we should turn it into a proper scheduler. We can start simple of course, and optimize as get a better idea of what we need. I could imagine even some kind of "scheduling hints" that Futures could set to help the Scheduler handle each State Machine in a use-case optimized ways, priorities, and whole thing. Of course - if it turns out ever necessary. For start we probably need only some ability to sort by some combination of "last tried" "overall age" to prioritize new things and things that probably should get a chance to retry, and then take N of them, and run only that subset.

If the State Machines require permanent failure, they should handle it themselves. Scheduler should treat all returned errors as retriable, and a scheduling hint: "this thing couldn't make progress right now, put it on a backburner, try later, etc. This will allow simple implementations that just try, fail and get restarted in a adaptive way. Futures that want to, can have their own implementation of retries.

Scheduler should expose "runtime stats" (real time duration, total runtime, number of other futures, etc.) in some context struct, and pass them to trigger functions. These context structs will be a required argument to pass to query strategies (more on it later), but also let implementations that require more custom retries to make more informed decisions, by consulting their stats.

To address the #3472 specifically, the "runtime stats" context struct would be passed to each query and the query implementation will use it to avoid subscribing and adjusting timeouts in smarter ways. To recap - "if a LN payment did not work out in the first 30 minutes, there's no point in wasting resources subscribing to notifications; just check and exit immediately". Would be nice if we didn't need to pass this context manually, but I doubt there's a good way to have non-global implicit future-specific context managed by... tokio executor or whatever else. Passing it manually doesn't seem like a big deal.

@joschisan @elsirion

dpc commented 1 year ago

@elsirion Making sure you've seen it.

elsirion commented 1 year ago

Sounds like a cool optimization that can be added later on by making the stats available via the GlobalContext?

dpc commented 1 year ago

Initial minimal implementation would just mean letting trigger futures return Result, treat Err as "retry me later", and implement some form of "exponential sleep & retry".

Next small step would be to maybe to put some medium-long timeout on every future, just to prevent them all from holding to jsonrpc slots for too long.

Next small step would be this context struct and passing it to query strategies to auto-cap (turn off) waiting and turn it into polling. But now that I wrote it - maybe it's unnecessary. Maybe deadline on the whole future runtime is enough at all times anyway.

Then in distant future the basic "exponential sleep & retry" could become smarter.

fedimint / fedimint

Client state machine failure and retry abstraction #3530