Open lucabelluccini opened 3 years ago
As a side node:
1) ~It is no more possible to stop the transform job if it enters in such state.~ You have to force
stop it with _stop?force
{
"error" : {
"root_cause" : [
{
"type" : "status_exception",
"reason" : "Unable to stop transform [...] as it is in a failed state with reason [Failed to index documents into destination index due to permanent error:
2) ~You cannot delete the transform as it is not stopped.~ You can force
delete it with the ?force
parameter
Pinging @elastic/es-analytics-geo (Team:Analytics)
Pinging @elastic/ml-core (Team:ML)
Adding the transform label, too. If flattened
by design disallows \0
transform should sanitize the output.
This shouldn't affect Aggs, perhaps @jtibshirani might be interested as it relates to flattened fields?
Pinging @elastic/es-search (Team:Search)
Do we support if you put a bell character in the field? Or a vertical tab? What about smart quotes?
Do we support if you put a bell character in the field? Or a vertical tab? What about smart quotes?
For this issue see, only \0
is reserved, everything else is just handled as bytes ref / string.
However before that is called the string is parsed using a XContent parser, the parser has the following restrictions:
These are the restrictions for ordinary field names, however it is wrong that we apply those restrictions to flattened
, too. We shouldn't use this parser, but one without those checks. This is technically a different issue, see #90011. Nevertheless we should fix both issues.
Possible fix
What's missing is proper escaping. This is a problem in the initial design, if escaping is added it must go in without breaking backwards compatibility.
Solution A
I think \0
can be escaped as \0\0
, that means a single \0
is interpreted as a separator, but \0\0
means a zero byte in the key. This should work for all existing indexed data as obviously no data exists with a zero byte in the key. But there is 1 corner case that could exist although highly unlikely: if a value starts with a zero byte it gets read differently with escaping:
If the initial doc contains: "key":"\0zvalue"
it got serialized as key\0\0zvalue
If you now read this back:
without escaping: "key":"\0zvalue"
with escaping: "key\0zvalue":""
With escaping de-serialization wouldn't find a value, but it should be possible to create a workaround for this special case. Note that a zero byte infix in a value is not a problem as the code only looks for the first \0
.
Solution B
A completely safe solution is to introduce a new version of flattened
that has proper escaping. The version switch could be pinned to a lucene version of the index. We start escaping only for indices after that and keep failing for old indices. The benefit of this solution is the freedom to change anything on flattened
. This obviously a much larger change and I don't know if there is anything else to be improved in flattened
.
As for going forward: I think we should create a separate issue for the problems around flattened
.
I filed https://github.com/elastic/elasticsearch/issues/90311 to discuss removing the restriction in flattened
fields. We probably won't get the chance to work on it in the near future. Feel free to comment there if you encounter other users struggling with this, so we can gauge the priority.
Elasticsearch version (
bin/elasticsearch --version
): 7.x (where Transform Jobs & Flattened type exist)Plugins installed: []
JVM version (
java -version
): Any supportedDescription of the problem including expected versus actual behavior:
Transform job fails due to an exception while creating a
flattened
field with keys containing\0
. We use aflattened
field to store the results ofterms
aggregations.This is a problem of data quality more than Transform Jobs.
Maybe
flattened
fields should allow any value as key? It seems\0
is reserved for Flattened (https://github.com/elastic/elasticsearch/blob/73e0662f091809226fe3c3d9374f63b1b96bb2ce/server/src/main/java/org/elasticsearch/index/mapper/flattened/FlattenedFieldParser.java#L30)Steps to reproduce:
1) Index the following documents in the index
demo
with mappings:Programmatically with Python:
3) Create the following Transform Job
4) Wait for failure (visible in the Transform Job stats & logs):
How to identify faulty logs: