Stop closing transport channels on benign deserialisation failures

DaveCTurner commented 3 years ago

Today if we fail to deserialize an incoming message we typically don't consume the right number of bytes, which is treated as a fatal problem with the TCP channel. We close the channel, and therefore all associated channels, cancelling any other in-flight requests between the same pair of nodes and generally causing a good deal of collateral damage.

This seems overly dramatic in many cases. The message header should tell us enough about the message to be able to skip over it and notify any waiting listeners of the problem, so as long as we got the header ok we should be able to keep the channel open and carry on processing messages.

TBC we should be asserting that this doesn't happen, it indicates a bug, but sometimes these bugs slip through tests particularly if the incompatibility was introduced by a dependency and only occurs in an unusual config. It might be that tearing down all the channels on each bad message actively prevents working around the bug in a production environment (e.g. by completing an ongoing upgrade).

elasticmachine commented 3 years ago

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner commented 3 years ago

We (the @elastic/es-distributed team) discussed this today and concluded that yes once we have read the message header we can and should handle failures reading the message body more gracefully.

M-Shash commented 2 years ago

how can I work around it if possible ??

DaveCTurner commented 2 years ago

@M-Shash of you hit this situation then it's a definite bug, most likely fixed in a more recent version so the first thing to try is to upgrade. If it's not fixed, please report the bug.

elastic / elasticsearch

Stop closing transport channels on benign deserialisation failures #75334