Closed lucabelluccini closed 1 month ago
Pinging @elastic/es-search (Team:Search)
Pinging @elastic/es-distributed (Team:Distributed)
I wonder if we have an easy way to fix this issue?
IIUC, we need to
Is this enough and is there a good way to find bad mapping indices? (I wonder if it's not only limited to Unicode, but may expand to more patterns, which might be not that easy to check until we got rejected by the failure.)
I also feel, it might be great if we block it at the index creation stage, because it doesn’t really make logical sense to make it "OK to snapshot" but "NG to restore/mount".
Had a chat with @Leaf-Lin, the reason behind this seems to be it's not great to prevent users from creating fields that are based on Unicode, because users in a different language would have fields that are completely normal to them, but ES is unable to process it correctly.
However, given it's causing discrepancy behavior in “index creation” and “snapshot/restore”, which probably ideally best to get things aligned.
Is there a way to across this? e.g. make an "encoding logic" internally to avoid using Unicode directly? (like URL encoding is widely used in lots of applications)
This seems to be a SMILE bug, or at least something that's not supported properly in SMILE. The following test fails for SMILE (and CBOR) but passes for JSON and YAML.
diff --git a/server/src/test/java/org/elasticsearch/common/xcontent/BaseXContentTestCase.java b/server/src/test/java/org/elasticsearch/common/xcontent/BaseXContentTestCase.java
index 96b93568c66..9ab3cce8aa8 100644
--- a/server/src/test/java/org/elasticsearch/common/xcontent/BaseXContentTestCase.java
+++ b/server/src/test/java/org/elasticsearch/common/xcontent/BaseXContentTestCase.java
@@ -140,6 +140,7 @@ public abstract class BaseXContentTestCase extends ESTestCase {
expectUnclosedException(() -> BytesReference.bytes(builder().startObject().field("foo")));
assertResult("{'foo':'bar'}", () -> builder().startObject().field("foo").value("bar").endObject());
+ assertResult("{'\\u0000':'','\\u0000\\u0000':''}", () -> builder().startObject().field("\0", "").field("\0\0", "").endObject());
}
public void testNullField() throws IOException {
The trouble is that the SMILE parser treats these field names as short ASCII strings which get cached to avoid unnecessary instantiation, but the cache is keyed by an integer representation of the string and both of these strings map to 0
.
I don't think this is a general Unicode problem, it's only going to affect field names that are made up of some short sequence of NUL bytes. I have reported this at https://github.com/FasterXML/jackson-dataformats-binary/issues/312.
Can we perhaps forbid field names containing NUL bytes entirely? Are they ever anything but a mistake?
Thank you David for the prompt analysis.
Can we perhaps forbid field names containing NUL bytes entirely? Are they ever anything but a mistake?
It could be a nice feature. Most times I've seen it was a client (Fluentd or other products) trying to index garbage data. I would always allow an escape hatch (an index setting or a cluster setting ?).
Any news on this? we have a not restorable index cause of this bug
The Jackson bug is fixed upstream, but a fixed version (≥2.14.0) is yet to be released.
Starting from Elasticsearch 8.6 we upgraded jackson to 2.14. This should be fixed now.
Elasticsearch Version
7.14.2, 7.17.0 (probably earlier too)
Installed Plugins
No response
Java Version
bundled
OS Version
not relevant
Problem Description
If a user accidentally ingests JSON documents which have weird/malformed bodies, the generated mappings due to dynamic mapping will make the snapshot of the index fail on restore.
This can happen also during ILM (when the index is moved to mounted phases when using searchable snapshots) or during a normal snapshot restore operation.
Steps to Reproduce
Both mounting and restore operations end up with:
This can happen also during ILM (when the index is moved to mounted phases when using searchable snapshots).
Logs (if relevant)
No response