track: Concurrent signer state updates cause collisions

cdecker commented 1 month ago

While testing the upcoming 0.3 release, which includes VLS 0.11.1 to address #431 and #476 we noticed that some of the signer responses aren't being accepted, because they appear to be causing concurrent updates to shared state.

The state is split into smaller domains, as key-value-pairs that can be updated individually. This allows us to attach the signer state to requests early, queuing them up, and avoid having dependencies between requests. The goal is to make the request, and its associated state as self-contained as possible. This depends on how we split the signer state into smaller domains.

Our current theory is that the upgrade to VLS 0.11.1 results in some global state to also be updated a shared domain, thus the second request returning a response, that was based on a stale state, will result in a stale state update, which is rejected.

Roadmap (WIP)

[x] Instrument the tower to see the updates in their entirety, allowing us to diff the states, and verify that it is an access to a shared state.
[ ] Re-split the domains such that requests no longer overlap
[ ] (maybe) Change the stream_hsm_requests method to only ever have one request in flight at a time, and bind the state late (this will have some performance impact as it does not allow pipelining requests anymore)

What to do for now?

The issue will manifest randomly, but it will also solve itself via a restart. Either let the node be preempted after 15 minutes at most (closing the app and not issuing new commands will let it be preempted), or a call to stop() will stop the node, and the next RPC call will schedule it again.

cdecker commented 1 month ago

Instances

roeierez commented 1 month ago

@cdecker when that happened to me stop command didn't help. Only after waiting some time the signer went back to normal again

cdecker commented 1 month ago

Yep, that is to be expected, since the grpc connection is shared, the signer disconnecting may affect the RPC connection.

We have since stopped forwarding the collision errors to the signer, which stops the disconnect-loop. It does not fix the underlying issue, in that we cannot return a signer response to CLN if the associated state update failed, but from what I can tell most (if not all) occurrences are actually no-op collisions (i.e., signer0 sends back state version=100 value=a, and signer1 sends back state version=100, value=a, the same values), in which case we can forward the response to CLN.

We are logging the occurrences and monitoring for any non-no-op collisions, in which case we should also have an idea of where the collision occurs and how to slice the state to avoid these collisions.

Closing this one in a month if no such collisions present themselves.

Blockstream / greenlight