elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.71k stars 24.67k forks source link

Transform jobs can fail if there's a \0 in fields where we perform terms aggregation (stored in flattened fields) #75875

Open lucabelluccini opened 3 years ago

lucabelluccini commented 3 years ago

Elasticsearch version (bin/elasticsearch --version): 7.x (where Transform Jobs & Flattened type exist)

Plugins installed: []

JVM version (java -version): Any supported

Description of the problem including expected versus actual behavior:

Transform job fails due to an exception while creating a flattened field with keys containing \0. We use a flattened field to store the results of terms aggregations.

This is a problem of data quality more than Transform Jobs.

Maybe flattened fields should allow any value as key? It seems \0 is reserved for Flattened (https://github.com/elastic/elasticsearch/blob/73e0662f091809226fe3c3d9374f63b1b96bb2ce/server/src/main/java/org/elasticsearch/index/mapper/flattened/FlattenedFieldParser.java#L30)

Steps to reproduce:

1) Index the following documents in the index demo with mappings:

curl -XPUT /demo" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "url": {
        "properties": {
          "domain": {
            "type": "keyword"
          }
        }
      },
      "fortinet": {
        "properties": {
          "firewall": {
            "properties": {
              "sessionid": "long"
            }
          }
        }
      }
    }
  }
}'
curl -XPOST /demo/_doc/ok -H 'Content-Type: application/json' -d'
{
    "@timestamp": "2021-07-27T09:18:46.000Z",
  "url": {
    "domain": "thisisok.com"
  },
  "fortinet": {
    "firewall": {
      "sessionid": 456
    }
  }
}'

curl -XPOST /demo/_doc/faulty -H 'Content-Type: application/json' -d'
{
    "@timestamp": "2021-07-27T08:57:06.000Z",
  "url": {
    "domain": "\u0000something.com"
  },
  "fortinet": {
    "firewall": {
      "sessionid": 123
    }
  }
}'

Programmatically with Python:

import elasticsearch from Elasticsearch
es = Elasticsearch()
doc = {"@timestamp":"2021-07-27T08:57:06.000Z","url":{"domain":b"\0something.com"},"fortinet":{"firewall":{"sessionid":123}}}
es.index('demo', doc)
doc = {"@timestamp":"2021-07-27T08:57:06.000Z","url":{"domain":"thisisok.com"},"fortinet":{"firewall":{"sessionid":456}}}
es.index('demo', doc)

3) Create the following Transform Job

PUT _transform/demo-transform
{
  "source": {
    "index": [
      "demo"
    ]
  },
  "pivot": {
    "group_by": {
      "session_id": {
        "terms": {
          "field": "fortinet.firewall.sessionid",
          "missing_bucket": false
        }
      }
    },
    "aggregations": {
      "timestamp": {
        "min": {
          "field": "@timestamp"
        }
      },
      "urls": {
        "terms": {
          "field": "url.domain",
          "size": 10
        }
      }
    }
  },
  "frequency": "1m",
  "dest": {
    "index": "target-demo-transform"
  },
  "sync": {
    "time": {
      "field": "@timestamp",
      "delay": "60s"
    }
  },
  "settings": {
    "max_page_search_size": 500
  }
}

4) Wait for failure (visible in the Transform Job stats & logs):

Failed to index documents into destination index due to permanent error: [BulkIndexingException[Bulk index experienced [2] failures and at least 1 irrecoverable [TransformException[Destination index mappings are incompatible with the transform configuration.]; nested: MapperParsingException[failed to parse field [urls] of type [flattened] in document with id 'ACNeIW_bKqym1MNyLOiFKxoAAAAAAAAA'. Preview of field's value: '1']; nested: IllegalArgumentException[Keys in [flattened] fields cannot contain the reserved character \0. Offending key: [something.com].];; MapperParsingException[failed to parse field [urls] of type [flattened] in document with id 'ACNeIW_bKqym1MNyLOiFKxoAAAAAAAAA'. Preview of field's value: '1']; nested: IllegalArgumentException[Keys in [flattened] fields cannot contain the reserved character \0. Offending key: [something.com].];; java.lang.IllegalArgumentException: Keys in [flattened] fields cannot contain the reserved character \0. Offending key: [something.com].]. Other failures: ]; nested: TransformException[Destination index mappings are incompatible with the transform configuration.]; nested: MapperParsingException[failed to parse field [urls] of type [flattened] in document with id 'ACNeIW_bKqym1MNyLOiFKxoAAAAAAAAA'. Preview of field's value: '1']; nested: IllegalArgumentException[Keys in [flattened] fields cannot contain the reserved character \0. Offending key: [something.com].];; TransformException[Destination index mappings are incompatible with the transform configuration.]; nested: MapperParsingException[failed to parse field [urls] of type [flattened] in document with id 'ACNeIW_bKqym1MNyLOiFKxoAAAAAAAAA'. Preview of field's value: '1']; nested: IllegalArgumentException[Keys in [flattened] fields cannot contain the reserved character \0. Offending key: [something.com].];; MapperParsingException[failed to parse field [urls] of type [flattened] in document with id 'ACNeIW_bKqym1MNyLOiFKxoAAAAAAAAA'. Preview of field's value: '1']; nested: IllegalArgumentException[Keys in [flattened] fields cannot contain the reserved character \0. Offending key: [something.com].];; java.lang.IllegalArgumentException: Keys in [flattened] fields cannot contain the reserved character \0. Offending key: [something.com].]

How to identify faulty logs:

GET demo/_search
{"query":{"bool":{"filter":[{"exists":{"field":"url.domain"}},{"script":{"script":"if (doc.containsKey('url.domain') && doc['url.domain'].size() > 0) { return doc['url.domain'].value.indexOf('\u0000') != -1 } else return false;"}}]}}}
lucabelluccini commented 3 years ago

As a side node:

1) ~It is no more possible to stop the transform job if it enters in such state.~ You have to force stop it with _stop?force

{
  "error" : {
    "root_cause" : [
      {
        "type" : "status_exception",
        "reason" : "Unable to stop transform [...] as it is in a failed state with reason [Failed to index documents into destination index due to permanent error:

2) ~You cannot delete the transform as it is not stopped.~ You can force delete it with the ?force parameter

elasticmachine commented 3 years ago

Pinging @elastic/es-analytics-geo (Team:Analytics)

elasticmachine commented 3 years ago

Pinging @elastic/ml-core (Team:ML)

hendrikmuhs commented 3 years ago

Adding the transform label, too. If flattened by design disallows \0 transform should sanitize the output.

wchaparro commented 2 years ago

This shouldn't affect Aggs, perhaps @jtibshirani might be interested as it relates to flattened fields?

elasticsearchmachine commented 2 years ago

Pinging @elastic/es-search (Team:Search)

nik9000 commented 2 years ago

Do we support if you put a bell character in the field? Or a vertical tab? What about smart quotes?

hendrikmuhs commented 2 years ago

Do we support if you put a bell character in the field? Or a vertical tab? What about smart quotes?

For this issue see, only \0 is reserved, everything else is just handled as bytes ref / string.

However before that is called the string is parsed using a XContent parser, the parser has the following restrictions:

These are the restrictions for ordinary field names, however it is wrong that we apply those restrictions to flattened, too. We shouldn't use this parser, but one without those checks. This is technically a different issue, see #90011. Nevertheless we should fix both issues.

Possible fix

What's missing is proper escaping. This is a problem in the initial design, if escaping is added it must go in without breaking backwards compatibility.

Solution A

I think \0 can be escaped as \0\0, that means a single \0 is interpreted as a separator, but \0\0 means a zero byte in the key. This should work for all existing indexed data as obviously no data exists with a zero byte in the key. But there is 1 corner case that could exist although highly unlikely: if a value starts with a zero byte it gets read differently with escaping:

If the initial doc contains: "key":"\0zvalue" it got serialized as key\0\0zvalue

If you now read this back:

without escaping: "key":"\0zvalue" with escaping: "key\0zvalue":""

With escaping de-serialization wouldn't find a value, but it should be possible to create a workaround for this special case. Note that a zero byte infix in a value is not a problem as the code only looks for the first \0.

Solution B

A completely safe solution is to introduce a new version of flattened that has proper escaping. The version switch could be pinned to a lucene version of the index. We start escaping only for indices after that and keep failing for old indices. The benefit of this solution is the freedom to change anything on flattened. This obviously a much larger change and I don't know if there is anything else to be improved in flattened.

hendrikmuhs commented 2 years ago

As for going forward: I think we should create a separate issue for the problems around flattened.

jtibshirani commented 2 years ago

I filed https://github.com/elastic/elasticsearch/issues/90311 to discuss removing the restriction in flattened fields. We probably won't get the chance to work on it in the near future. Feel free to comment there if you encounter other users struggling with this, so we can gauge the priority.