Closed ch1bo closed 11 months ago
We have not experienced this much (might be more relevant if we would have UDP transport) and have no concrete user requests for this -> prioritize lower, still aiming 1.0.0 but maybe not.
See https://github.com/input-output-hk/hydra/issues/612 for an investigation of this loss of liveness.
Longer down times (depending on the contestation period Hydra protocol parameter) are not covered!
Why?
Cannot you just establish timeout policy on the matter, and Abort Head after consensus unreachable long enough?
Cannot you just establish timeout policy on the matter, and Abort Head after consensus unreachable long enough
I think we don't count closing the head as "covering network failures".
@ch1bo
I think we don't count closing the head as "covering network failures".
But that seems like the only option to react to network failure long enogh.
Could that matter be covered by other issue when?
Of courses could this be covered by another issue, it's just that this issue was not intended to be a out this case. I don't think we have a current item for this though.. maybe also because that time out of non-progress of a head could be detected & handled by the application running on top of Hydra. Network issues are not the only source of non-progress
@ch1bo
Network issues are not the only source of non-progress
Yes, and that possibility is integral part of consensus.
maybe also because that time out of non-progress of a head could be detected & handled by the application running on top of Hydra.
I am not sure that this will be good approach for API. I think such timeout is dependent on internal details on consensus and Hydra may implement different strategies on that depending on some internal information.
The upside client-side Halting is that one could easily change such behavior. But you can achieve same with option to disable Hydra-side halting and/or to provide message if Hydra server thinks consensus reached timeout (and thus recommends client to perform Halting).
@uhbif19 Bullets 2 and 3 indicate you are thinking about this not only on the "Hydra network" level. Which is fine itself, the things you mention all have an influence in liveness one way or another.
However, within this feature, we want to take one concrete step into improving the situation by re-submitting Hydra network messages - that is, the L2 protocol for reaching consensus in a Hydra head. This already requires grooming & planning and some open questions still remain (see "to be discussed" in original post)
@ch1bo Yes, of course, restricting scope is important, you right. I just wanted to record my thoughts on this, not necessarily as part of this issue.
We have drawn up a pull-based workflow in a sequence diagram today (@pgrange please provide more context from your write-up)
sequenceDiagram
Alice->>A: broadcast msg1
Alice->>Alice: msg1
Alice->>A: broadcast msg2
Alice->>Alice: msg2
Note over B: start B network stack
B-->>A: connect
note left of A: after seeing any message
A->>Alice: PeerConnected
note over A: concurrently
A-->>B: connect
note over A: readIndex B == 1
A->>B: Send msg1
A->>B: Send msg2
B->>Bob: callback msg1
B->>A: Ack msg1
A->>A: readIndex B = 2
Bob->>Bob: protocol logic
note over B: crashes
note over A: detects connection down (how?)
A-->>B: connect
note over A: readIndex B == 2
A->>+B: Send msg2
B->>Bob: callback msg2
B->>-A: Ack msg2
A->>A: readIndex B = 3
Bob->>Bob: protocol logic
https://hackmd.io/c/tutorials/%2Fs%2FMathJax-and-UML#UML-Diagrams provides UML diagrams, include sequence diagrams. I propose we use such a document to collaborate on the design of this networking protocol.
Here is a draft PR with a protocol specification proposal to comment: https://github.com/input-output-hk/hydra/pull/1050
IMO this feature should cover this scenario to be of value to our users: https://github.com/input-output-hk/hydra/pull/1074#pullrequestreview-1646120692 That is:
carol
-> See SnapshotConfirmed #1
alice
TxValid
alice
-> expect it to catch up and result in SnapshotConfirmed #2
carol
-> Should see SnapshotConfirmed #3
Removing #1080 as part of this issue
Why
The Hydra Head becomes stuck very easily which is bad for user experience. The state machine in the
HeadLogic
does not make progress as it is waiting from some "signal" from peers, or from the chain. This can happen from a wide variety of reasons: When the connection between twohydra-node
s breaks down, or when one node crashes and restarts. When the Head stalls because of missing responses, it currently needs to be closed & re-opened to continue operation.This issue specifically wants to address one source of "stalling", namely the transient network partitioning between peers.
What
What kind of resilience do we expect:
How
This is a large feature and therefore we want to split it in several deliveries:
Non goals
PeerConnected/Disconnected
messages we send themWait
outcomes)?Network
layer without touching theHeadLogic
. In the case of crash-recovery, theHeadLogic
will come back at the same state it was before, and the only concern is about "in-flight" network messages that might have been lost