Rework the agent's state machine

fredo commented 2 years ago

I have been bothered a long time with the way the state machine in the agent is written. With the recent mainnet tests we have seen that the way the state machine is written is a fundamental problem and errors pop up when we have a large set of events being processed on sync.

Currently we are doing patches to fix those issues but this is more like patching a flat tire again and again instead of solving this issue fundamentally. I'm convinced that we will see problems in event processing in the future again which we will have to patch on a case by case basis. From a developer's point of view a horrible thing to see.

I'd like to propose rewriting the state machine of the agent and solve the broken state machine fundamentally. In the current design I see several downsides:

Downsides

Overlapping of state machines

The agent actually serves different roles in the system. Namely, being a liquidity provider (LP) and being a watcher (W). Subsets of actions are similar although the general concept differs. Providing liquidity is fundamentally different from watching out for invalid claims. Currently we have one state machine to rule it all. This causes several problems. For each state change or output generation we might have to see which role the agent takes place in this event. It might even be the case that from the pure event it is not distinguishable if the agent is the LP or W in this current event. Which brings me to the next problem

Lack of state

If it wasn't enough we also merge the contract states of different sources (chains), and keep what we think we would need. This results in two major problems. First, we might want to wait for a slow source which is blocking processing other events but at the same time might generate useful contract state for other objects (Example: Invalidation events are of interest for multiple claims. What if one claim has not been seen yet due to a slow source chain?). Combining contract state, and state machines into one system, brings us to a sheer infinite number of combinations which we have to take care of when processing state. Especially on sync, as we have seen, combinations of events have been popped up which the agent cannot handle.

Reordering of events

As a result of not processing events we reorder events which is not always anticipated by the agent's state machine.

State changes

Another thing I realized is that state changes are called from multiple places in the code base. Besides events triggering new state changes we occasionally transition state in the output generation (process_requests and process_claims). Although in theory there shouldn't be a problem it seems to me as an potential risk and increased complexity if we transition state at any place. A more consistent way would be creating internal events to transition state in the next iteration.

Potential solutions

Separating contract state from roles (state machines)

For one, we could easily keep the contract state separated. The internal contract state would just be a representation of the contract state until the last event which was processed. By not combining the state we can consume any event which arrives in order (events typically arrive in order). Thus we reconstruct the state of the contracts and can read from other locations in the state machine of what to do with these information. This brings the advantage that we remove the necessity of not processing events due to synchronization of multiple chains.

Separate state machines for different roles

In the current version we implement state machines for request and claim objects. But thinking about it, request and claim objects are only partial state of a greater state machine which semantically would fit better to the question What does the role do?.

Liquidity provider

As I see it, the liquidity provider is serving transfers (I think unintentionally we tailored a transfer's state machine onto the request object). Each interaction with the chain has something to do with a transfer from an LPs perspective. In contrast to the current state machines requests and claims are partial information regarding a transfer. I have not spec'd it out yet, but I imagine a transfer state machine pointing to the corresponding request information and (multiple) claim(s) information.

Watcher

The watcher only cares about invalid claims and his goal is to win a challenge. So actually what the watcher is concerned about are challenges. A challenge consists of the information of the corresponding claim but I would separate the claim state and the challenge state machine in a similar way as done with the LP and transfers.

Open questions

It remains to be seen how we actually interact with the state machine if events are only changing state in the internal contract state representation. We could also pass the events to the corresponding state machines, here we would again need to take care about the above mentioned problems such es ordering of events due to slow chains, etc. Another idea could be having some internal event passing between the contract state and state machines.

Additionally it would be nice to have a clear way of how state changes are created and processed.

It also remains to be seen how event processing in the state machine of the different roles happen and if the mentioned problems above still occur and need to be addressed.

Overall I think we have now much more knowledge than 8 months ago and can use this knowledge to write a better state machine.

istankovic commented 2 years ago

It also remains to be seen how event processing in the state machine of the different roles happen and if the mentioned problems above still occur and need to be addressed.

Yeah, I too think this will probably be one of the bigger concerns here. One thing we could (should?) do is sketch state machines for both the watcher and the liquidity provider, assuming we already have CSR(-layer) working.

One thing I think we should keep in mind is #654: we should make the design in such a way so that it is trivial to add support for multiple chain pairs.

fredo commented 2 years ago

This seems to be a bigger issue, with preparation work (specifying the design/state machines) before implementation. Given the forecast of the next month and vacation schedule taken into account, I think it would be a nice work which could be done at the retreat. I imagine a workshop like the ones which took place in the past in Mainz. @czepluch what do you think about this?

fredo commented 2 years ago

we will encorporate #1068 here

czepluch commented 2 years ago

I think the idea of a workshop is a very good idea. It might be interesting to do it on the retreat? Preferably sooner if you can find the time, but I doubt it since there is a lot of vacation going on until then.

fredo commented 2 years ago

I think the idea of a workshop is a very good idea. It might be interesting to do it on the retreat? Preferably sooner if you can find the time, but I doubt it since there is a lot of vacation going on until then.

I added it to the retreat agenda.

beamer-bridge / beamer