Improve `Endpoint` resilience to drop "stale" connections on the `maker`

da-kami commented 2 years ago

@holzeis and me were able to reproduce a problem where the taker becomes unavailable ungracefully there are scenarios where the maker's endpoint is not notified with ConnectionDropped. We were able to reproduce this by killing the taker process on the raspi. The maker did not receive a connection drop in such a case.

This behavior results in the endpoint and protocol actors assuming that there is still a connection and the maker then fails to establish substreams when a protocol execution is triggered.

At this point we are not totally sure if this problem can be solved by configuring how the TCP connection is managed by yamux, or if we are not reacting to an event from the underlying stack that would notify us. This can be investigated, but it might be hard, and it is unclear if we can always rely on this.

So, it might still be better, to make the Endpoint resilient to "missed" connection drops, where we keep trying to open a substream but fail.

Some notes on solution design:

We did make an effort to overcome this problem before, by introducing an failure counter, see #2175, but removed this feature because it had unwanted side effects (see #2207). I think the overall idea of this feature was still good, but we just need to design it in a better way.

da-kami commented 2 years ago

Here is my proposal that I tried to implement at some point but did not follow up:

Instead of a counter for failures, we record the timestamp when a protocol was successful. Initially we can opt for doing this for all protocols for simplicity.
If no protocol recorded a success timestamp for 30 minutes (this should be good enough for making sure rollovers are picked up again) we actively drop the connection.
Initially I would only add this logic on the dialer side, because that is where it becomes a severe problem. The maker (= listener) does not actively run into trouble with "broken" connections, because it does not trigger the important protocols (i.e. protocols that keep the position alive / closes it). Furthermore, if the taker acts sane and drops the connection the maker should handle that properly. [1]

[1] In case we actually run into problems of a maker not allowing a taker to connect twice with this logic we should notice. But given that the taker actively drops the connection this should really not happen.

After working on the proposal, unfortunately it is not as simple as expected.

Recording success is hard, because we cannot just do it in the endpoint, but each protocol actor would have to record it individually. I was working on a protocol_health actor that each protocol actor could notify upon success, but that is very intrusive.

It might be better to, again, consider a more pragmatic solution and just drop the connection if we fail to open substreams.

luckysori commented 2 years ago

This behavior results in the endpoint and protocol actors assuming that there is still a connection and the maker then fails to establish substreams when a protocol execution is triggered.

It would be really interesting to know exactly what error we get on yamux. In this comment I suggested a couple of candidates that could warrant dropping the connection.

So, it might still be better, to make the Endpoint resilient to "missed" connection drops, where we keep trying to open a substream but fail.

I think it would be much easier to implement something based on the particular error that we get. Some errors should not lead to dropping a connection, no matter how frequent. Whereas others might be able to tell us immediately that the connection is broken and needs to be remade.

Restioson commented 2 years ago

It would be really interesting to know exactly what error we get on yamux. In https://github.com/itchysats/itchysats/pull/2566#discussion_r931691524 I suggested a couple of candidates that could warrant dropping the connection.

Does this line provide the relevant information?

luckysori commented 2 years ago

Does this line provide the relevant information?

Exactly, that's why I added it 😛

get10101 / itchysats

Improve `Endpoint` resilience to drop "stale" connections on the `maker` #2554