Network resilience to disconnects

ch1bo commented 2 years ago

Why

The Hydra Head becomes stuck very easily which is bad for user experience. The state machine in the HeadLogic does not make progress as it is waiting from some "signal" from peers, or from the chain. This can happen from a wide variety of reasons: When the connection between two hydra-nodes breaks down, or when one node crashes and restarts. When the Head stalls because of missing responses, it currently needs to be closed & re-opened to continue operation.

This issue specifically wants to address one source of "stalling", namely the transient network partitioning between peers.

What

What kind of resilience do we expect:

When a node fails and never restart, one of the peers must close the head "manually": You can't recover from permanent failures 😛
1. ✅ If a node is disconnected from the other peers and reconnects (network partition failure mode), the head should still be live
2. 🔴 When a node fails and then restart (crash-recovery failure mode) the head should still be live
3. 🔴 When a node starts to misbehave (eg. Byzantine failure mode), the head should be closed manually: We don't provide any BFT guarantee, as we assume it's always possible to close the head with latest known snapshot.

How

This is a large feature and therefore we want to split it in several deliveries:

[x] #1082
[x] #1074

Non goals

Prevent submitting new transactions while not connected to all (configured?) peers
- This is requiring the (much harder) understanding of "being connected"
- Is this a problem or not? A head can only progress through snapshots, but in the case of a peer being disconnected, a snapshot can be pending for a while and we can accumulate unconfirmed txs
- It's probably an application/client concern: A client can know that a snapshot hasn't been produced for a while and even not use the PeerConnected/Disconnected messages we send them
What about the HeadLogic, will it loop on old message requests? How long to re-enqueue messages (also applies to Wait outcomes)?
- We realised that we can handle this at the Network layer without touching the HeadLogic. In the case of crash-recovery, the HeadLogic will come back at the same state it was before, and the only concern is about "in-flight" network messages that might have been lost
- We also realised that the original protocol from the paper was not using broadcast but unicast (or multicast for some messages) and had forms of acknowledgement for both tx and snapshots. Using broadcast allowed to simplify the protocol, assuming everyone received every message and therefore could 1/ not need tx confirmation and 2/ compute snapshot locally, but it introduced the inability to acknowledge messages to peers
Can we use model tests with fuzziness to test this? (chaos monkeys) -> Yes

ch1bo commented 2 years ago

We have not experienced this much (might be more relevant if we would have UDP transport) and have no concrete user requests for this -> prioritize lower, still aiming 1.0.0 but maybe not.

ch1bo commented 1 year ago

See https://github.com/input-output-hk/hydra/issues/612 for an investigation of this loss of liveness.

uhbif19 commented 1 year ago

Longer down times (depending on the contestation period Hydra protocol parameter) are not covered!

Why?

Cannot you just establish timeout policy on the matter, and Abort Head after consensus unreachable long enough?

ch1bo commented 1 year ago

Cannot you just establish timeout policy on the matter, and Abort Head after consensus unreachable long enough

I think we don't count closing the head as "covering network failures".

uhbif19 commented 1 year ago

@ch1bo

I think we don't count closing the head as "covering network failures".

But that seems like the only option to react to network failure long enogh.

Could that matter be covered by other issue when?

ch1bo commented 1 year ago

Of courses could this be covered by another issue, it's just that this issue was not intended to be a out this case. I don't think we have a current item for this though.. maybe also because that time out of non-progress of a head could be detected & handled by the application running on top of Hydra. Network issues are not the only source of non-progress

uhbif19 commented 1 year ago

@ch1bo

Network issues are not the only source of non-progress

Yes, and that possibility is integral part of consensus.

maybe also because that time out of non-progress of a head could be detected & handled by the application running on top of Hydra.

I am not sure that this will be good approach for API. I think such timeout is dependent on internal details on consensus and Hydra may implement different strategies on that depending on some internal information.

That requires Hydra client to have some state & timers, while it can be completely stateless otherwise. Also in such case you should somehow state it in docs, for clients to know that they should handle such cases.
Timeouts before calling Halting for such client will depend tx time for on L1 and their finality criteria used by Hydra. And how many tx on L1 is performed too, which may change with Hydra releases. So client may need to change such timeouts at least for different networks and Hydra releases.
You handle L1 rollbacks. I think that affects timeouts as well, while client does not know if rollbacks happened.
Hydra consensus may change in future changing this behavior as well.

The upside client-side Halting is that one could easily change such behavior. But you can achieve same with option to disable Hydra-side halting and/or to provide message if Hydra server thinks consensus reached timeout (and thus recommends client to perform Halting).

ch1bo commented 1 year ago

@uhbif19 Bullets 2 and 3 indicate you are thinking about this not only on the "Hydra network" level. Which is fine itself, the things you mention all have an influence in liveness one way or another.

However, within this feature, we want to take one concrete step into improving the situation by re-submitting Hydra network messages - that is, the L2 protocol for reaching consensus in a Hydra head. This already requires grooming & planning and some open questions still remain (see "to be discussed" in original post)

uhbif19 commented 1 year ago

@ch1bo Yes, of course, restricting scope is important, you right. I just wanted to record my thoughts on this, not necessarily as part of this issue.

ch1bo commented 1 year ago

We have drawn up a pull-based workflow in a sequence diagram today (@pgrange please provide more context from your write-up)

sequenceDiagram
    Alice->>A: broadcast msg1
    Alice->>Alice: msg1
    Alice->>A: broadcast msg2
    Alice->>Alice: msg2

    Note over B: start B network stack
    B-->>A: connect
    note left of A: after seeing any message
    A->>Alice: PeerConnected

    note over A: concurrently
    A-->>B: connect
    note over A: readIndex B == 1
    A->>B: Send msg1
    A->>B: Send msg2

    B->>Bob: callback msg1
    B->>A: Ack msg1
    A->>A: readIndex B = 2
    Bob->>Bob: protocol logic
    note over B: crashes

    note over A: detects connection down (how?)

    A-->>B: connect
    note over A: readIndex B == 2
    A->>+B: Send msg2
    B->>Bob: callback msg2
    B->>-A: Ack msg2
    A->>A: readIndex B = 3
    Bob->>Bob: protocol logic

abailly-iohk commented 1 year ago

https://hackmd.io/c/tutorials/%2Fs%2FMathJax-and-UML#UML-Diagrams provides UML diagrams, include sequence diagrams. I propose we use such a document to collaborate on the design of this networking protocol.

pgrange commented 1 year ago

Here is a draft PR with a protocol specification proposal to comment: https://github.com/input-output-hk/hydra/pull/1050

ch1bo commented 11 months ago

IMO this feature should cover this scenario to be of value to our users: https://github.com/input-output-hk/hydra/pull/1074#pullrequestreview-1646120692 That is:

Start the demo devnet with 3 participants and open a head
Submit a transaction via carol -> See SnapshotConfirmed #1
Stop node of alice
Submit another transaction via carol -> See only TxValid
Start node of alice -> expect it to catch up and result in SnapshotConfirmed #2
Submitting yet another transaction by carol -> Should see SnapshotConfirmed #3

abailly-iohk commented 11 months ago

Removing #1080 as part of this issue

cardano-scaling / hydra

Network resilience to disconnects #188

Why

What

How

Non goals