Data corruption on transport protocol on 1/16 of the nodes

Elasticsearch version (bin/elasticsearch --version): Version: 6.8.18, Build: oss/tar/aca2329/2021-07-28T16:06:05.232873Z, JVM: 11.0.12

Plugins installed: analysis-icu

JVM version (java -version):

openjdk version "11.0.12" 2021-07-20
OpenJDK Runtime Environment (build 11.0.12+7-1)
OpenJDK 64-Bit Server VM (build 11.0.12+7-1, mixed mode)

OS version (uname -a if on a Unix-like system): FreeBSD fmfe00 13.0-STABLE FreeBSD 13.0-STABLE #4 stable/13-n245004-ea1f0ecc15bc-dirty: Sun Mar 21 16:20:19 CET 2021 root@builder:/usr/obj/usr/src/amd64.amd64/sys/BOOTCLNT amd64

Description of the problem including expected versus actual behavior: On a cluster with 16 data nodes strange errors occur if one of the nodes are in the cluster, but -until now- only with one of the indices. We have corrupt data errors with the Java (transport) client and elasticsearch errors as well. Typical errors on elastic side: TranslogCorruptedException[translog from source [recovery] is corrupted, checksum verification failed - expected: 0x57ede47c, got: 0x895d92fc java.io.StreamCorruptedException: invalid internal transport message format and the like, all showing that there's some kind of a corruption on the node, happening on the transport protocol. However, when I move the index in question off from it, everything works fine.

Steps to reproduce: The easiest way to reproduce the error:

create a new index in the cluster with number_of_replicas: 2 and with 40 shards (same as the source index), excluding that node
reroute the primary shards to the given node
start a reindex (happening with a multi thread python script, which uses bulk operations) on the index in question
after a while, transport_serialization_exception errors start to appear

So far I could reproduce this with a high success ratio.

Provide logs (if relevant):

During the reindex process, I get these errors for bulk ops:

{'create': {'_id': '15264381',
            '_index': 'accounts_681',
            '_type': '_doc',
            'error': {'caused_by': {'reason': 'unexpected byte [0x08]',
                                    'type': 'illegal_state_exception'},
                      'reason': 'Failed to deserialize response from handler '
                                '[org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler]',
                      'type': 'transport_serialization_exception'},
            'status': 500}}

So far I could reproduce this only when the failing node is the source node (ie it holds the primary shard. If it has only the secondaries, the reindex process could finish).

In the data node's log I can see errors which show that there may be some corruption on the transport channel between the primary and the replica: error1.txt

This has many variations, like this: error2.txt

We also use the Java transport client to interact with elasticsearch which shows mangled JSONs being read. We occasionally (I guess when reading from the problematic node and maybe some other conditions) get results like this (it's just part of the whole JSON):

 "eventDates": {
    "registrationDatnatures": {}, 
    "autoReply": {}, 
    "eated": "2016-08-02T11:43:11Z",

Here, registrationDatnatures should be registrationDate and eated should be created. While the above happens to be a fine JSON, its content is not. Sometimes illegal characters (UTF-8 decoding errors) or JSON syntax errors occur. Pretty scary. If I reroute shards from that node, these errors disappear, however during normal operations we have these: error3.txt which shows another node couldn't recover from the failing node.

The machine in question is the same as the other 15, physical nodes and nothing suspicious in the OS logs (which points towards a bad CPU or (ECC) RAM). Previously the cluster was running with JDK 15, but now it was rolled back to 11 to see if that causes the problem, but it still persists.

All machines are equal, they run off from the same image, so I'm not sure what could cause this error, but I think if it would be a HW error, normally it should have other effects as well. Any ideas/debug settings/code are welcome.

@bra-fsn, because there's a high likelihood of a hardware, network, or environmental cause for this problem, it seems more like a support question that we typically direct to our discussion forum rather than a bug report. Based on a quick reading, I would note two things -- first, we do not test or support Elasticsearch on any BSD operating systems so while we would not necessarily expect these kinds of problems, we do no testing that would catch problems that could be related to differences in either the network stack or filesystem. Second, there have been a lot of bug fixes in the 7.x versions around both networking and reindexing. It would be easier to classify this as a bug if it could be reproduced on a supported operating system and more recent Elasticsearch version.

@bra-fsn, because there's a high likelihood of a hardware, network,

These machines are server grade hardware, with ecc ram etc. I don't say I can exclude this, but the machine runs rock stable and there's no sign about any failures there. Can a network failure cause such corruptions in a local, checksummed Ethernet network on TCP? Anyways, error counters look the same on each machines, I can't see any abnormal behavior here.

Based on a quick reading, I would note two things -- first, we do not test or support Elasticsearch on any BSD operating systems

I know and we could catch errors, which were harder to reproduce in a Linux environment. :)

Second, there have been a lot of bug fixes in the 7.x versions around both networking and reindexing. It would be easier to classify this as a bug if it could be reproduced on a supported operating system and more recent Elasticsearch version.

We're working on it, but sadly can't yet do it.

@danhermann ok, I can reproduce this with 7.15.0 reliably. What I did:

set up a 7.15.0 cluster on the same machines, consisting of 16 servers (we have a bigger cluster, but only 16 of them are data nodes)
created the same index on the 7.x cluster (with the same replica count of 2)
started to reindex the index in question from the 6.x cluster

I can see various data corruption like issues, like:

[2021-09-30T15:33:58,432][WARN ][o.e.c.r.a.AllocationService] [fmfe01] failing shard [failed shard, shard [accounts_681][4], node[YMKsToN-T6G445NCT73SzQ], [R], s[STARTED], a[id=cs4dkv7jSQW989Ouh5TbxA], message [failed to perform indices:data/write/bulk[s] on replica [accounts_681][4], node[YMKsToN-T6G445NCT73SzQ], [R], s[STARTED], a[id=cs4dkv7jSQW989Ouh5TbxA]], failure [RemoteTransportException[[fmfe08][10.6.145.201:9301][indices:data/write/bulk[s][r]]]; nested: IOException[Invalid vInt ((ffffffff & 0x7f) << 28) | ffffffff]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [fmfe08][10.6.145.201:9301][indices:data/write/bulk[s][r]]
Caused by: java.io.IOException: Invalid vInt ((ffffffff & 0x7f) << 28) | ffffffff
    at org.elasticsearch.common.io.stream.StreamInput.throwOnBrokenVInt(StreamInput.java:245) ~[elasticsearch-7.15.0.jar:7.15.0]

and

MapperParsingException[failed to parse field [displayName] of type [keyword] in document with id '20671981'. Preview of field's value: '']; nested: IOException[Invalid UTF-8 start byte 0x88
 at [Source: (ByteArrayInputStream); line: 1, column: 135]]; ], markAsStale [true]]

and

MapperParsingException[failed to parse field [password] of type [keyword] in document with id '49875026'. Preview of field's value: '']; nested: IOException[Invalid UTF-8 start byte 0xf8
 at [Source: (ByteArrayInputStream); line: 1, column: 649]]; ], markAsStale [true]]

I made a tcpdump from the reindexing job (it's a python based multithread/process bulk reindexer) and verified that the fields are correct. For example for the last one with the document ID 49875026, a create bulk op was submitted:

{"create":{"_id":"49875026","_index":"accounts_681","routing":"49875026","_type":"_doc"}}
[... data ...]

and elastic returned this according to the tcpdump:

{"create":{"_index":"accounts_681","_type":"_doc","_id":"51494012","_version":1,"result":"created","_shards":{"total":3,"successful":2,"failed":1,"failures":[{"_index":"accounts_681","_shard":36,"_node":"vm1XbWvvRwGBRLdlbKWTew","reason":{"type":"mapper_parsing_exception","reason":"failed to parse field [password] of type [keyword] in document with id '49875026'. Preview of field's value: ''","caused_by":{"type":"i_o_exception","reason":"Invalid UTF-8 start byte 0xf8\n at [Source: (ByteArrayInputStream); line: 1, column: 649]"}},"status":"BAD_REQUEST","primary":false}]}

There seems to be something very strange going on, at least I wouldn't expect document ID 49875026 in a response which is made for doc 51494012.

If I copy either bulk op out from the stream and execute them manually, the documents can be inserted without error. BTW, when running the same reindex twice the errors are always different and with different docs. The doc is indexed apart from being returned by elastic as failure and -I guess- it tries to re-send to the replica, so the cluster quickly switches from YELLOW to GREEN. Again, this seems to affect only the outgoing transport (replication and the Java client is where we could observe these corruptions) stream and it sometimes works, sometimes not. Seems to be depending on the level of stress, because I can finely reindex on just one thread -at least it's so slow, I could never see this error :). Things start to get worse when I reindex fast, on multiple connections and with a lot of bulkops in parallel.

So please, could you please help in how to debug this? It may be OS specific and could be a bug there for sure, but this may be an bug in elastic/netty/whatever as well, which is harder to reproduce on another OS, so a net win to hunt down here, where I can easily reproduce it. :)

elastic / elasticsearch

Data corruption on transport protocol on 1/16 of the nodes #78338