Open DaveCTurner opened 5 years ago
Pinging @elastic/es-distributed
We have a built-in way to dump packets here. Set logger.org.elasticsearch.transport.netty4.ESLoggingHandler
to trace
. It is ideal to set both logger.org.elasticsearch.transport.TransportLogger
and logger.org.elasticsearch.transport.netty4.ESLoggingHandler
to trace
. This gives output like:
[2019-02-20T09:36:28,034][TRACE][o.e.t.n.ESLoggingHandler ] [totoro.home.tedor.me] [id: 0xaf6ecc34, L:/127.0.0.1:9300 - R:/127.0.0.1:59862] READ: 142B
+-------------------------------------------------+
| 0 1 2 3 4 5 6 7 8 9 a b c d e f |
+--------+-------------------------------------------------+----------------+
|00000000| 45 53 00 00 00 88 00 00 00 00 00 00 00 16 00 00 |ES..............|
|00000010| 7a 12 63 00 00 01 06 78 2d 70 61 63 6b 29 69 6e |z.c....x-pack)in|
|00000020| 64 69 63 65 73 3a 64 61 74 61 2f 72 65 61 64 2f |dices:data/read/|
|00000030| 78 70 61 63 6b 2f 63 63 72 2f 73 68 61 72 64 5f |xpack/ccr/shard_|
|00000040| 63 68 61 6e 67 65 73 00 00 01 06 6c 65 61 64 65 |changes....leade|
|00000050| 72 10 80 28 06 6c 65 61 64 65 72 16 78 42 31 72 |r..(.leader.xB1r|
|00000060| 69 69 36 30 54 36 4f 67 46 79 42 61 4a 6f 64 74 |ii60T6OgFyBaJodt|
|00000070| 63 77 00 16 77 66 6d 32 73 36 68 49 52 45 32 53 |cw..wfm2s6hIRE2S|
|00000080| 6e 41 73 6a 43 6f 54 64 6b 41 02 04 40 02 |nAsjCoTdkA..@. |
+--------+-------------------------------------------------+----------------+
[2019-02-20T09:36:28,035][TRACE][o.e.t.TransportLogger ] [totoro.home.tedor.me] Netty4TcpChannel{localAddress=/127.0.0.1:9300, remoteAddress=/127.0.0.1:59862} [length: 142, request id: 22, type: request, version: 8.0.0, action: indices:data/read/xpack/ccr/shard_changes] READ: 142B
We discussed this as a team. It was pointed out that the ESLoggingHandler
mentioned above is Netty-specific, but could perhaps be moved into the core of Elasticsearch. It also produces a very large amount of logs which means it wouldn't always be possible to use if tracking down a problem that can only be reproduced in a production environment.
We decided that the following two actions would be enough to resolve this issue:
record the action name in exception messages resulting from deserialisation failures, for both requests and responses.
allow filtering by action name in the ESLoggingHandler
(again, both for requests and responses) so that it is possible to dump traffic for a single action.
EDIT: adding a 3rd action item:
ESLoggingHandler
so that we can see the contents of a whole message no matter how large it is.I would like to take it up? @DaveCTurner
@abhiroj are you still interested? You can work on this issue!
Hi @DaveCTurner and @andrershov, I've encountered same issue. Can I pick up this issue since @krillln is not following up?
Sure, go ahead @wangkhc, the steps laid out in https://github.com/elastic/elasticsearch/issues/38939#issuecomment-474858468 still look good to me.
@DaveCTurner we recently fixed these messages to now log the action that the broken message came from. Maybe that's good enough? Do we really need to dump the bytes specifically for one kind of request, seems the issue is mostly instantly obvious if you know hat action is broken isn't it?
I think it's often going to be obvious but not always - I still think we should be able to log the broken bytes if needed.
Occasionally we come across a serialization bug, particularly when nodes of multiple versions are involved. Here is a report of an issue in a cross-cluster search scenario involving 6.5.1 nodes, 5.6.2 nodes, and indices dating all the way back to 2.x. The exception we get is not very helpful:
Today our two best options for diagnosing this are to reproduce it (often tricky without the user's exact setup) or to grab a packet capture and find a problematic message (which only works if they are not using TLS). It'd be awesome if we could capture and log the whole content of the problematic message so as to avoid messing around with
tcpdump
and so we can deal with this even if TLS is enabled.This kind of issue tends to be easy for the user to reproduce, so this capture-and-log thing would not need to happen all the time: we could instead consider something that can be enabled dynamically.