Serialization bugs are painful to track down

DaveCTurner commented 5 years ago

Occasionally we come across a serialization bug, particularly when nodes of multiple versions are involved. Here is a report of an issue in a cross-cluster search scenario involving 6.5.1 nodes, 5.6.2 nodes, and indices dating all the way back to 2.x. The exception we get is not very helpful:

[2019-02-14T23:53:52,630][WARN ][o.e.t.n.Netty4Transport  ] [IK-PRD-M3] exception caught on transport layer [[id: 0xd97a9d8c, L:/10.10.1.184:51594 - R:10.10.1.166/10.10.1.166:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [7719647], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.transport.TransportActionProxy$ProxyResponseHandler@7f2fcd88], error [false]; resetting

Today our two best options for diagnosing this are to reproduce it (often tricky without the user's exact setup) or to grab a packet capture and find a problematic message (which only works if they are not using TLS). It'd be awesome if we could capture and log the whole content of the problematic message so as to avoid messing around with tcpdump and so we can deal with this even if TLS is enabled.

This kind of issue tends to be easy for the user to reproduce, so this capture-and-log thing would not need to happen all the time: we could instead consider something that can be enabled dynamically.

elasticmachine commented 5 years ago

Pinging @elastic/es-distributed

jasontedor commented 5 years ago

We have a built-in way to dump packets here. Set logger.org.elasticsearch.transport.netty4.ESLoggingHandler to trace. It is ideal to set both logger.org.elasticsearch.transport.TransportLogger and logger.org.elasticsearch.transport.netty4.ESLoggingHandler to trace. This gives output like:

[2019-02-20T09:36:28,034][TRACE][o.e.t.n.ESLoggingHandler ] [totoro.home.tedor.me] [id: 0xaf6ecc34, L:/127.0.0.1:9300 - R:/127.0.0.1:59862] READ: 142B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 45 53 00 00 00 88 00 00 00 00 00 00 00 16 00 00 |ES..............|
|00000010| 7a 12 63 00 00 01 06 78 2d 70 61 63 6b 29 69 6e |z.c....x-pack)in|
|00000020| 64 69 63 65 73 3a 64 61 74 61 2f 72 65 61 64 2f |dices:data/read/|
|00000030| 78 70 61 63 6b 2f 63 63 72 2f 73 68 61 72 64 5f |xpack/ccr/shard_|
|00000040| 63 68 61 6e 67 65 73 00 00 01 06 6c 65 61 64 65 |changes....leade|
|00000050| 72 10 80 28 06 6c 65 61 64 65 72 16 78 42 31 72 |r..(.leader.xB1r|
|00000060| 69 69 36 30 54 36 4f 67 46 79 42 61 4a 6f 64 74 |ii60T6OgFyBaJodt|
|00000070| 63 77 00 16 77 66 6d 32 73 36 68 49 52 45 32 53 |cw..wfm2s6hIRE2S|
|00000080| 6e 41 73 6a 43 6f 54 64 6b 41 02 04 40 02       |nAsjCoTdkA..@.  |
+--------+-------------------------------------------------+----------------+
[2019-02-20T09:36:28,035][TRACE][o.e.t.TransportLogger    ] [totoro.home.tedor.me] Netty4TcpChannel{localAddress=/127.0.0.1:9300, remoteAddress=/127.0.0.1:59862} [length: 142, request id: 22, type: request, version: 8.0.0, action: indices:data/read/xpack/ccr/shard_changes] READ: 142B

DaveCTurner commented 5 years ago

We discussed this as a team. It was pointed out that the ESLoggingHandler mentioned above is Netty-specific, but could perhaps be moved into the core of Elasticsearch. It also produces a very large amount of logs which means it wouldn't always be possible to use if tracking down a problem that can only be reproduced in a production environment.

We decided that the following two actions would be enough to resolve this issue:

record the action name in exception messages resulting from deserialisation failures, for both requests and responses.
allow filtering by action name in the ESLoggingHandler (again, both for requests and responses) so that it is possible to dump traffic for a single action.

EDIT: adding a 3rd action item:

avoid truncating log messages from the ESLoggingHandler so that we can see the contents of a whole message no matter how large it is.

panwarab commented 5 years ago

I would like to take it up? @DaveCTurner

andrershov commented 5 years ago

@abhiroj are you still interested? You can work on this issue!

wangkhc commented 4 years ago

Hi @DaveCTurner and @andrershov, I've encountered same issue. Can I pick up this issue since @krillln is not following up?

DaveCTurner commented 4 years ago

Sure, go ahead @wangkhc, the steps laid out in https://github.com/elastic/elasticsearch/issues/38939#issuecomment-474858468 still look good to me.

original-brownbear commented 2 years ago

@DaveCTurner we recently fixed these messages to now log the action that the broken message came from. Maybe that's good enough? Do we really need to dump the bytes specifically for one kind of request, seems the issue is mostly instantly obvious if you know hat action is broken isn't it?

DaveCTurner commented 2 years ago

I think it's often going to be obvious but not always - I still think we should be able to log the broken bytes if needed.

elastic / elasticsearch

Serialization bugs are painful to track down #38939