cowprotocol / services

Off-chain services for CoW Protocol
https://cow.fi/
Other
142 stars 61 forks source link

feat: Implement basic circuit breaker in autopilot #2667

Open fleupold opened 2 months ago

fleupold commented 2 months ago

Problem

Solvers that are running their own driver, could theoretically submit solutions outside of the competition (simply calling settle with a bunch of signed orders). In this case the duration for which a malicious solver can continue this behavior is critical for the damage that gets accrued (the more orders settled at limit price, the higher the foregone surplus).

It's therefore important for the protocol to run an automated and diverse set of "watch dogs" that monitor misbehaviour and "circuit break" a malicious solvers from continuing damage.

Suggested solution

Implement basic competition checks in the autopilot upon indexing settlements and allow the autopilot to be configured with an account that can removeSolver from the current allow list implementation.

Alternatives considered

Fully rely on external systems (e.g. the EBBO check, a new standalone circuit breaker tool) for this task. However, we believe that having multiple independent implementations of this crucial logic is important to ensure robust workings of the system. Moreover, the autopilot is still taking a central role in arbitrating auctions and thus is in a good position to enforce and potentially halt the issuance of new auctions in case something goes wrong.

We also already have reliable and reorg-resistant indexing of all settlements, which should™️ make adding this logic a fairly contained task (compared to building a new standalone system from scratch). This is not to say that we shouldn't have multiple independent services that can perform this task in the future.

Additional context

Since the barn and production environment operate on separate auction but share the same settlement contract instance, we probably cannot simply audit all settlement events that are emitted and ban if a settlement is emitted "out of competition". If we did, the prod autopilot would have to check the barn autopilot if a settlement satisfied its competition and vice versa. Moreover, there are some legit out of competition settlements (e.g. for periodic fee withdrawals). I'd suggest configuring each driver with a set of allow-listed solver addresses it tracks settlements for (and ignores all other settlements). It's then up to the autopilot configuration to make sure all allow-listed solver addresses are covered. Monitoring all settlements and querying both the production and barn orderbook API can be left for the EBBO circuit breaker.

The current allow list contract is located at https://etherscan.io/address/0x2c4c28DDBdAc9C5E7055b4C863b72eA0149D8aFE with https://app.safe.global/balances?safe=eth:0xA03be496e67Ec29bC62F01a428683D7F9c204930 being the owner.

With the help of Zodiac, we can enable a Roles Modifier Module on the Safe that allows an arbitrary set of accounts to perform specific operations on behalf of the Safe without requiring additional signatures for them signers. This tx creates a role allowing a single EOA to remove solvers from the allow list and this tx executes a removal on behalf of the multi-sig on Sepolia. The latter is what we would expect the autopilot to perform in case wrong-doing is detected.

Acceptance criteria

Every driver that is available in the autopilot should be registered with their public address for solution submission. For those addresses, whenever a settlement is observed the autopilot checks:

  1. The settlement references an auction ID
  2. The address that submitted the settlement won this auction
  3. Exactly the orders that were referenced in the winning solution got settled and matched at the prices indicated in the solution
  4. Pre and post interactions found in the settlement are a subset of the pre/post interactions of the matched orders

If any of those checks fails, the autopilot should send a transaction to remove the affected solver from the allow list and send an alert.

sunce86 commented 1 month ago

I'd suggest configuring each driver with a set of allow-listed solver addresses it tracks settlements for (and ignores all other settlements)

Instead I would do the following:

  1. During competition time, when we have current auction available, consider only solutions with orders that either
    • are part of the auction
    • not part of the auction but also not existent in the db (JIT)
  2. During competition, for winning solution, save Quality to db so it can be checked in postprocessing. Quality means EncodedSettlement metadata saved to db (tokens, clearing prices, trades). Instead of saving everything we could hash it and compare hashes later (I think we can assume solvers won't deliver better solution than promised because why would they do that instead of pocketing the diff themselves).

This way we make sure both checks are true: solver did not settle orders outside competition but also with promised quality.

MartinquaXD commented 1 month ago

Instead of saving everything we could hash it and compare hashes later (I think we can assume solvers won't deliver better solution than promised because why would they do that instead of pocketing the diff themselves).

I'm worried that using just hashes is too unstable. I think given that settlements are usually not very big we should start by storing all necessary data unhashed and only later optimize for space if this actually becomes a concern. Also when there is a discrepancy between the executed and expected value it would already be easier to analyse what went wrong which should help with recovering damages.

sunce86 commented 1 month ago

In that case, if we don't want to hash but instead save full metadata, I would suggest saving dto representation of domain settlement object: https://github.com/cowprotocol/services/blob/main/crates/autopilot/src/domain/settlement/mod.rs#L22. This object contains subset of encoded settlement data that autopilot is interested in: it can derive surplus/fee/score based on it. If we ever add a new field autopilot is interested in, it will be part of this struct.

This object is also easily recoverable from plain calldata that is provided by /reveal/ driver endpoint.

Edit: we have a settlement_call_data database table where we save calldata of the winning solution, so we might as well fetch it from there (and convert to Settlement) instead of saving individual fields.

fleupold commented 1 month ago

I'd suggest configuring each driver with a set of allow-listed solver addresses it tracks settlements for (and ignores all other settlements)

Instead I would do the following: [...] This way we make sure both checks are true: solver did not settle orders outside competition but also with promised quality.

I still think we need to configure each autopilot with a list of solver addresses it needs to "observe" as otherwise a solver could just return an empty solution and still settle out of competition (without specifying an auction ID). By only adding solver addresses to the on-chain allow list that are tracked by at least one autopilot (barn vs. prod) we make sure every solver is being observed.

All of the promised solution data is already stored in the competitions table (we may have to refactor this one to be more "stable" as it was considered a pure debugging table in the past).

Wrt /reveal I think we can even consider deleting this endpoint altogether. It was used to ensure we have the uninternalised calldata of the winning solution (to make sure buffer trading is "ok"), but since recently we treat buffer trading as slippage and thus don't require proof that the trading was indeed possible. The only advantage I see from /reveal is that it could simplify implementing https://github.com/cowprotocol/contracts/issues/84 by simply signing the entire calldata (instead of creating a new struct), but I'm not sure the extra round trip is worth it.

sunce86 commented 1 month ago

I still think we need to configure each autopilot with a list of solver addresses it needs to "observe" as otherwise a solver could just return an empty solution and still settle out of competition (without specifying an auction ID). By only adding solver addresses to the on-chain allow list that are tracked by at least one autopilot (barn vs. prod) we make sure every solver is being observed.

I still don't understand why we need this configuration. Can you elaborate this case further as I don't see how we would be unable to detect it?

without specifying an auction ID

This is a slashable violation itself. We would call removeSolver on this violation.

fleupold commented 1 month ago

I still don't understand why we need this configuration. Can you elaborate this case further as I don't see how we would be unable to detect it?

We are running two autopilots, barn and prod. How would the barn autopilot be able to "check" that a settlement of the prod autopilot is ok (it requires access to the prod db or API)? Therefore it must either ignore settlements from a certain list of addresses (knowing that some other component checks it) or only focus on settlements coming from a certain list of addresses. As long as we make sure all addresses that are allow listed to solve are covered by at least one autopilot this is fine (we can have another more generalized backup circuit breaker that the solver team is working on which checks all settlements based on API access).

Note, that there are also fee withdrawals which create Settlement events and shouldn't lead to removals, e.g.: https://etherscan.io/tx/0x54f65d217e31cc07d759f3c23b70c564bd906bad942f4b813ce56971cc6854d7/advanced#eventlog

sunce86 commented 1 month ago
  1. Pre and post interactions found in the settlement are a subset of the pre/post interactions of the matched orders

Pre/post interactions are not a part of promised solution (solution returned on /solve during competition). Does this mean we need to add interactions to this response? Is that what we want to compare here? Or do we want to compare with pre/post interactions in the order database table?

fleupold commented 1 month ago

Correct, pre and post interactions are available to the autopilot via the user order. In the first phase (where we are simply circuit breaking whenever we see a bad settlement) we should be able to assert that those interactions were part of the final settlement.

Later, when we want to sign off on the solution things may be a bit more complicated (the autopilot can sign off on a subset of interactions as solvers are allowed to add additional pre/post interactions to their settlements).

MartinquaXD commented 1 month ago

Later, when we want to sign off on the solution things may be a bit more complicated

This basically brings us back to needing the /reveal endpoint, no? Otherwise solvers either have to produce the complete executable solution on /solve which MMs don't like because they only want to hedge risks if they win or we blindly trust that solvers actually end up executing the pre- and post-interactions they promised in /solve.

Until the autopilot actually signs the winning calldata I don't think we need to have the pre- and post-interactions in the /solve endpoint. Whatever is returned there is not be trusted anyway and it's reasonable to expect solvers to include these custom interactions in their solution. Since we are able to recover custom interactions for orders only with the calldata (by checking our DB with the settled order uids) I don't think we need to add the info in the /solve endpoint. Or am I missing something here?

fleupold commented 1 month ago

This basically brings us back to needing the /reveal endpoint, no?

I don't think so. The autopilot can still sign off on the list of pre and post interactions it knows from the orders (without knowing anything from the driver), as long as the driver makes sure they encode that list first (or last). This way in the smart contract checking the settlement transaction, we can just look at the first n pre/post interactions of the solution and use that subset to re-compute the commitment and verify the autopilot signature.

This might be a bit too technical for our current issues, so I wouldn't discuss this in detail on this issue.