Docs indexed using UTF16-LE encoding are not re-encoded properly in UTF8 JSON responses

Elasticsearch accepts JSON documents encoded with UTF16-LE, although its JSON responses are all encoded with UTF8. However, when returning raw document source it will copy the original source bytes verbatim to the output, and that doesn't work if the original source bytes are not UTF8-encoded.

For instance, here is a UTF16-LE encoded doc:

$ echo 'ewAiAGYAbwBvACIAOgAiAGIAYQByACIAfQA=' | base64 -D | xxd
00000000: 7b00 2200 6600 6f00 6f00 2200 3a00 2200  {.".f.o.o.".:.".
00000010: 6200 6100 7200 2200 7d00                 b.a.r.".}.

Here is me writing this doc into a new index:

$ echo 'ewAiAGYAbwBvACIAOgAiAGIAYQByACIAfQA=' | base64 -D | curl --silent 'http://localhost:9200/testindex/_doc?refresh&pretty' -H'Content-type: application/json' --data-binary @-
{
  "_index" : "testindex",
  "_type" : "_doc",
  "_id" : "a_PFrYoB9nsREFcmq1__",
  "_version" : 1,
  "result" : "created",
  "forced_refresh" : true,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

And here's the exact bytes returned from an attempt to retrieve the contents of this index:

$ curl --silent 'http://localhost:9200/testindex/_search' | xxd
00000000: 7b22 746f 6f6b 223a 302c 2274 696d 6564  {"took":0,"timed
00000010: 5f6f 7574 223a 6661 6c73 652c 225f 7368  _out":false,"_sh
00000020: 6172 6473 223a 7b22 746f 7461 6c22 3a31  ards":{"total":1
00000030: 2c22 7375 6363 6573 7366 756c 223a 312c  ,"successful":1,
00000040: 2273 6b69 7070 6564 223a 302c 2266 6169  "skipped":0,"fai
00000050: 6c65 6422 3a30 7d2c 2268 6974 7322 3a7b  led":0},"hits":{
00000060: 2274 6f74 616c 223a 7b22 7661 6c75 6522  "total":{"value"
00000070: 3a31 2c22 7265 6c61 7469 6f6e 223a 2265  :1,"relation":"e
00000080: 7122 7d2c 226d 6178 5f73 636f 7265 223a  q"},"max_score":
00000090: 312e 302c 2268 6974 7322 3a5b 7b22 5f69  1.0,"hits":[{"_i
000000a0: 6e64 6578 223a 2274 6573 7469 6e64 6578  ndex":"testindex
000000b0: 222c 225f 7479 7065 223a 225f 646f 6322  ","_type":"_doc"
000000c0: 2c22 5f69 6422 3a22 615f 5046 7259 6f42  ,"_id":"a_PFrYoB
000000d0: 396e 7352 4546 636d 7131 5f5f 222c 225f  9nsREFcmq1__","_
000000e0: 7363 6f72 6522 3a31 2e30 2c22 5f73 6f75  score":1.0,"_sou
000000f0: 7263 6522 3a7b 0022 0066 006f 006f 0022  rce":{.".f.o.o."
00000100: 003a 0022 0062 0061 0072 0022 007d 007d  .:.".b.a.r.".}.}
00000110: 5d7d 7d                                  ]}}

Note the NUL bytes in the source towards the end of the response. They shouldn't be there, this isn't valid JSON.

One possible workaround is to force ES to parse and re-encode the doc, for instance by setting ?filter_path=*:

$ curl --silent 'http://localhost:9200/testindex/_search?filter_path=*' | xxd
00000000: 7b22 746f 6f6b 223a 312c 2274 696d 6564  {"took":1,"timed
00000010: 5f6f 7574 223a 6661 6c73 652c 225f 7368  _out":false,"_sh
00000020: 6172 6473 223a 7b22 746f 7461 6c22 3a31  ards":{"total":1
00000030: 2c22 7375 6363 6573 7366 756c 223a 312c  ,"successful":1,
00000040: 2273 6b69 7070 6564 223a 302c 2266 6169  "skipped":0,"fai
00000050: 6c65 6422 3a30 7d2c 2268 6974 7322 3a7b  led":0},"hits":{
00000060: 2274 6f74 616c 223a 7b22 7661 6c75 6522  "total":{"value"
00000070: 3a31 2c22 7265 6c61 7469 6f6e 223a 2265  :1,"relation":"e
00000080: 7122 7d2c 226d 6178 5f73 636f 7265 223a  q"},"max_score":
00000090: 312e 302c 2268 6974 7322 3a5b 7b22 5f69  1.0,"hits":[{"_i
000000a0: 6e64 6578 223a 2274 6573 7469 6e64 6578  ndex":"testindex
000000b0: 222c 225f 7479 7065 223a 225f 646f 6322  ","_type":"_doc"
000000c0: 2c22 5f69 6422 3a22 615f 5046 7259 6f42  ,"_id":"a_PFrYoB
000000d0: 396e 7352 4546 636d 7131 5f5f 222c 225f  9nsREFcmq1__","_
000000e0: 7363 6f72 6522 3a31 2e30 2c22 5f73 6f75  score":1.0,"_sou
000000f0: 7263 6522 3a7b 2266 6f6f 223a 2262 6172  rce":{"foo":"bar
00000100: 227d 7d5d 7d7d                           "}}]}}

elastic / elasticsearch

Docs indexed using UTF16-LE encoding are not re-encoded properly in UTF8 JSON responses #99669