Closed da-kami closed 2 years ago
Here is my proposal that I tried to implement at some point but did not follow up:
timestamp
when a protocol was successful. Initially we can opt for doing this for all protocols for simplicity. 30
minutes (this should be good enough for making sure rollovers are picked up again) we actively drop the connection.[1] In case we actually run into problems of a maker not allowing a taker to connect twice with this logic we should notice. But given that the taker actively drops the connection this should really not happen.
After working on the proposal, unfortunately it is not as simple as expected.
Recording success is hard, because we cannot just do it in the endpoint, but each protocol actor would have to record it individually. I was working on a protocol_health
actor that each protocol actor could notify upon success, but that is very intrusive.
It might be better to, again, consider a more pragmatic solution and just drop the connection if we fail to open substreams.
This behavior results in the endpoint and protocol actors assuming that there is still a connection and the maker then fails to establish substreams when a protocol execution is triggered.
It would be really interesting to know exactly what error we get on yamux
. In this comment I suggested a couple of candidates that could warrant dropping the connection.
So, it might still be better, to make the Endpoint resilient to "missed" connection drops, where we keep trying to open a substream but fail.
I think it would be much easier to implement something based on the particular error that we get. Some errors should not lead to dropping a connection, no matter how frequent. Whereas others might be able to tell us immediately that the connection is broken and needs to be remade.
It would be really interesting to know exactly what error we get on yamux. In https://github.com/itchysats/itchysats/pull/2566#discussion_r931691524 I suggested a couple of candidates that could warrant dropping the connection.
Does this line provide the relevant information?
@holzeis and me were able to reproduce a problem where the taker becomes unavailable ungracefully there are scenarios where the maker's endpoint is not notified with
ConnectionDropped
. We were able to reproduce this by killing thetaker
process on the raspi. Themaker
did not receive a connection drop in such a case.This behavior results in the endpoint and protocol actors assuming that there is still a connection and the maker then fails to establish substreams when a protocol execution is triggered.
At this point we are not totally sure if this problem can be solved by configuring how the TCP connection is managed by
yamux
, or if we are not reacting to an event from the underlying stack that would notify us. This can be investigated, but it might be hard, and it is unclear if we can always rely on this.So, it might still be better, to make the
Endpoint
resilient to "missed" connection drops, where we keep trying to open a substream but fail.Some notes on solution design:
We did make an effort to overcome this problem before, by introducing an failure counter, see #2175, but removed this feature because it had unwanted side effects (see #2207). I think the overall idea of this feature was still good, but we just need to design it in a better way.