Elasticsearch accepts JSON documents encoded with UTF16-LE, although its JSON responses are all encoded with UTF8. However, when returning raw document source it will copy the original source bytes verbatim to the output, and that doesn't work if the original source bytes are not UTF8-encoded.
Elasticsearch accepts JSON documents encoded with UTF16-LE, although its JSON responses are all encoded with UTF8. However, when returning raw document source it will copy the original source bytes verbatim to the output, and that doesn't work if the original source bytes are not UTF8-encoded.
For instance, here is a UTF16-LE encoded doc:
Here is me writing this doc into a new index:
And here's the exact bytes returned from an attempt to retrieve the contents of this index:
Note the NUL bytes in the source towards the end of the response. They shouldn't be there, this isn't valid JSON.
One possible workaround is to force ES to parse and re-encode the doc, for instance by setting
?filter_path=*
: