get10101 / itchysats

CFDs on Bitcoin.
https://itchysats.network
MIT License
61 stars 17 forks source link

Improve `Endpoint` resilience to drop "stale" connections on the `maker` #2554

Closed da-kami closed 2 years ago

da-kami commented 2 years ago

@holzeis and me were able to reproduce a problem where the taker becomes unavailable ungracefully there are scenarios where the maker's endpoint is not notified with ConnectionDropped. We were able to reproduce this by killing the taker process on the raspi. The maker did not receive a connection drop in such a case.

This behavior results in the endpoint and protocol actors assuming that there is still a connection and the maker then fails to establish substreams when a protocol execution is triggered.

At this point we are not totally sure if this problem can be solved by configuring how the TCP connection is managed by yamux, or if we are not reacting to an event from the underlying stack that would notify us. This can be investigated, but it might be hard, and it is unclear if we can always rely on this.

So, it might still be better, to make the Endpoint resilient to "missed" connection drops, where we keep trying to open a substream but fail.


Some notes on solution design:

We did make an effort to overcome this problem before, by introducing an failure counter, see #2175, but removed this feature because it had unwanted side effects (see #2207). I think the overall idea of this feature was still good, but we just need to design it in a better way.

da-kami commented 2 years ago

Here is my proposal that I tried to implement at some point but did not follow up:

[1] In case we actually run into problems of a maker not allowing a taker to connect twice with this logic we should notice. But given that the taker actively drops the connection this should really not happen.


After working on the proposal, unfortunately it is not as simple as expected.

Recording success is hard, because we cannot just do it in the endpoint, but each protocol actor would have to record it individually. I was working on a protocol_health actor that each protocol actor could notify upon success, but that is very intrusive.

It might be better to, again, consider a more pragmatic solution and just drop the connection if we fail to open substreams.

luckysori commented 2 years ago

This behavior results in the endpoint and protocol actors assuming that there is still a connection and the maker then fails to establish substreams when a protocol execution is triggered.

It would be really interesting to know exactly what error we get on yamux. In this comment I suggested a couple of candidates that could warrant dropping the connection.

So, it might still be better, to make the Endpoint resilient to "missed" connection drops, where we keep trying to open a substream but fail.

I think it would be much easier to implement something based on the particular error that we get. Some errors should not lead to dropping a connection, no matter how frequent. Whereas others might be able to tell us immediately that the connection is broken and needs to be remade.

Restioson commented 2 years ago

It would be really interesting to know exactly what error we get on yamux. In https://github.com/itchysats/itchysats/pull/2566#discussion_r931691524 I suggested a couple of candidates that could warrant dropping the connection.

Does this line provide the relevant information?

luckysori commented 2 years ago

Does this line provide the relevant information?

Exactly, that's why I added it 😛