elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.19k stars 24.84k forks source link

[ML] Make anomaly detection jobs compatible with "subobjects" : false #88379

Open droberts195 opened 2 years ago

droberts195 commented 2 years ago

86166 added the option for object fields in mappings to have a subobjects : false setting. This in turn allows fieldnames with dots to be nested inside the object, without the usual object/scalar clashes that would arise if some scalar fields have more components than others with the same prefix.

For example, subobjects : false makes the following document possible:

{
  "@timestamp" : "2022-07-08T13:23:39",
  "metrics" : {
    "responsetime" : 100, 
    "responsetime.min" : 10,
    "responsetime.max" : 900
  }
}

The mappings for such a document could look like this:

{
  "metrics1": {
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "metrics": {
          "subobjects": false,
          "properties": {
            "responsetime": {
              "type": "double"
            },
            "responsetime.max": {
              "type": "double"
            },
            "responsetime.min": {
              "type": "double"
            }
          }
        }
      }
    }
  }
}

Historically it would have been possible to store the document, but only by completely disabling mappings for the metrics object. With subobjects : false the dotted fields under metrics can all have mappings and participate in searches and aggregations.

It is currently possible to create a job that analyses all these fields as the field_name of detector functions.

But supposed instead we also have dotted fields that we want to use as split fields for our job, for example:

{
  "metrics2": {
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "attributes": {
          "subobjects": false,
          "properties": {
            "service": {
              "type": "keyword"
            },
            "service.administrator": {
              "type": "keyword"
            },
            "service.category": {
              "type": "keyword"
            }
          }
        },
        "metrics": {
          "subobjects": false,
          "properties": {
            "responsetime": {
              "type": "double"
            },
            "responsetime.max": {
              "type": "double"
            },
            "responsetime.min": {
              "type": "double"
            }
          }
        }
      }
    }
  }
}

Now creation of the job fails if we try to reference multiple fields under attributes, for example:

{
  "statusCode": 400,
  "error": "Bad Request",
  "message": "[x_content_parse_exception: [status_exception] Reason: Fields [attributes.service] and [attributes.service.administrator] cannot both be used in the same analysis_config]: [1:359] [cluster:admin/xpack/ml/job/estimate_model_memory] failed to parse field [analysis_config]",
  "attributes": {
    "body": {
      "error": {
        "root_cause": [
          {
            "type": "status_exception",
            "reason": "Fields [attributes.service] and [attributes.service.administrator] cannot both be used in the same analysis_config"
          }
        ],
        "type": "x_content_parse_exception",
        "reason": "[1:359] [cluster:admin/xpack/ml/job/estimate_model_memory] failed to parse field [analysis_config]",
        "caused_by": {
          "type": "status_exception",
          "reason": "Fields [attributes.service] and [attributes.service.administrator] cannot both be used in the same analysis_config"
        }
      },
      "status": 400
    }
  }
}

The reason we prevent this is to make it possible to include the fields in our anomaly records.

Instead we could allow jobs to be created with fields like this, and instead change the mappings on our results indices. However, there is a problem here: because results indices can be shared, the results index may already exist with mappings that are incompatible with specifying subobjects : false in the results mappings.

It's tricky to incorporate this validation at the parsing stage, as the parser cannot be expected to check the mappings on an existing index.

We have two options:

  1. Change nothing - subobjects : false will work with anomaly detection jobs if the dotted fields are used as metrics, and this was the intended use case as seen in the PR title of #86166.
  2. Change our analysis_config parser to permit field names that would clash in the results if adding subobjects : false as a results mapping is not possible. Then fail when actually creating the job if creating our desired mappings is not possible. There is already a precedent for failing at this time - if the latest job would push the number of mapped fields in the shared results index over 1000 we fail the job creation at the point of modifying the results index.
elasticmachine commented 2 years ago

Pinging @elastic/ml-core (Team:ML)

droberts195 commented 1 year ago

https://github.com/elastic/elasticsearch/issues/88934 is likely to increase adoption of "subobjects" : false.