elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.72k stars 8.13k forks source link

[ML] Add structured tags to ML anomaly data points to make it possible to query for them #67180

Open sorenlouv opened 4 years ago

sorenlouv commented 4 years ago

Currently it is only possible to query for anomaly data points by job_id. The problem with the job_id it's not easy to query for specific attributes, and mostly we have to parse the job_id on the client to determine what service or transaction type the data point represent.

Example A job id might be opbeans-node-request-high_mean_response_time. We can make a helper function that extracts the service name (opbeans-node) and transaction type (request). But a job could span all transaction types and will therefore not include transaction type: opbeans-node-high_mean_response_time. Additionally we are soon going to add support for jobs per environment: opbeans-node-production-high_mean_response_time (where "production" is the env). This makes the job_id fragile.

Instead I propose that ML data points should contain user defined tags. This is how I'd like to be able to query for anomaly data:

Get anomaly data:

GET .ml-anomalies-*/_search

{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        { "term": { "result_type": "record" } },
        { "term": { "service.name": "opbeans-node" } },
        { "term": { "service.environment": "production" } },
        { "term": { "transaction.type": "request" } }
      ]
    }
  }
}

Create ML job

This is how I propose the API for creating an ML job should look like:

POST /api/ml/modules/setup/apm_transaction

{
    index: 'apm-*',
    tags: {
        "service.name": "opbeans-node",
        "service.environment": "production",
        "transaction.type": "request"
    },
    startDatafeed: true,
    query: {
      bool: {
        filter: {}
      }
    }
  }
elasticmachine commented 4 years ago

Pinging @elastic/ml-ui (:ml)

jgowdyelastic commented 4 years ago

It looks like proposed change would need to go in the job config and so should be an elasticsearch issue. @droberts195 would you agree? One possible way to implement this would be to add this tags section to the custom_settings which can contain job meta data.

droberts195 commented 4 years ago

It looks like proposed change would need to go in the job config and so should be an elasticsearch issue.

Yes, certainly part of the request is on the Elasticsearch side. It's asking for extra fields in every result written by the anomaly detector.

There is another side to this though, which is that once we complete the "ML in spaces" project it won't be desirable for Kibana apps to directly search the ML results index, but instead go through APIs in the ML UI. In the example of searching results by tag, no job ID is specified. So that implies the ML UI would provide a space-aware results endpoint that could search for results by tag, but taking into account which jobs are visible in the current space.

So this functionality is non-trivial both on the Elasticsearch side and the Kibana side.

droberts195 commented 4 years ago

Maybe job groups could achieve what is required here. It's getting late in my day, but another day we should think through more carefully how job groups could be used instead of adding more functionality that is doing something quite similar. If the job groups feature doesn't work as it stands then it may be better to meet this requirement by enhancing job groups rather than adding new overlapping functionality and then having someone in the future ask why we have both tags and job groups.

droberts195 commented 4 years ago

We discussed this on a Zoom call.

It turns out there shouldn't be a need to aggregate different values of service.environment in the same query - it doesn't make sense to combine results from testing and production for example. So it's OK that there are separate jobs per environment whose results cannot easily be aggregated.

We already add tags for a jobs "by" and "partition" fields. Therefore we agreed the requirement can be met by configuring "by_field_name" : "transaction.type" and "partition_field_name" : "service.name" for every detector in each job.

It will then be possible to do terms aggregations or terms filtering on documents with "result_type" : "record" using the fields service.name and transaction.type which will be present in such documents.

sorenlouv commented 4 years ago

@droberts195 Is there any difference between:

"by_field_name" : "transaction.type" and "partition_field_name" : "service.name" 

vs

"by_field_name" : "service.name" and "partition_field_name" : "transaction.type" 

Asking because by_field_name and partition_field_name seem very similar to me. I'd expect to define it like:

"dimensions": ["service.name", "transaction.type" ]
sophiec20 commented 4 years ago

@sqren By and partition fields behave differently in how the results are aggregated up the results hierarchy.

With the by field, if multiple values are anomalous at the same time then the overall bucket is considered more anomalous. With the partition field we consider individual behaviours. So, use by_field_name if the values are (sometimes) related, use the partition_field_name if they are individual.

Based on what we know about your data it makes more sense for the config

"by_field_name" : "transaction.type" and "partition_field_name" : "service.name" 
sorenlouv commented 4 years ago

Thanks for the background @sophiec20. I still have a few questions - please bear with me :)

So, use by_field_name if the values are (sometimes) related, use the partition_field_name if they are individual.

Based on what we know about your data it makes more sense for the config

"by_field_name" : "transaction.type" and "partition_field_name" : "service.name"

transaction.type and service.name are dimensions in the composite aggregation but are not themselves indicative of anomalies (transaction.duration.us is the anomalous part). So I'm not understanding why transaction.type is grouped by by_field_name and service.name by partition_field_name. I think of them as sibling/equivalent dimensions.

So something like this would intuitively make more sense to me:

by_field_name: ["service.name", "transaction.type"]

Based on what we know about your data it makes more sense for the config

Is this opbeans data, or apm data in general? Just wondering if we are optimizing for sample data, instead of real customer data.

sophiec20 commented 4 years ago

The anomaly detection modelling is complex, see https://github.com/elastic/ml-cpp .. so fundamental changes to the way jobs are configured and data is modelled is not trivial. It is not a visualisation of an aggregation and there are significant bwc implications to both the modelling and the elasticsearch APIs.

Some bulk APM data was made available to @blaklaybul last week and we are now working through the prototypes for job configurations as we've discussed above. It is always preferred to optimise against real customer data providing this usage of data is permitted. We are working with the data provided to us.

Once these prototype job configurations are ready we can walk through and explain results against data examples and show how this can support the requirement given regarding labelled results.

sorenlouv commented 4 years ago

Okay, I just want to make sure we are on the same page.

What we are interested in is very much the same behaviour we get today by starting individual jobs. To simplify the experience for users it would be beneficial if we can start a single job where anomalies are separated by a number of dimensions (service.name, transaction.type and service.environment).

Do you see by_field_name and partition_field_name as temporary workarounds or as the permanent solution towards this goal?

sorenlouv commented 4 years ago

The prototype @blaklaybul has made goes a long way by promoting service.name and transaction.type to first-class fields (via by_field_name and partition_field_name). These are added to the ML job and are propagated to ML results which is great! We were however not able to find a similar solution for service.environment, and still need to be able to query for ML jobs that belongs to a particular environment.

We've briefly talked about adding service.environment to ml jobs as a job group. We hoped this would allow us to retrieve jobs by environment but there are two problems with job groups:

limited character set According to the ML docs, job groups may only contain "lowercase alphanumeric characters (a-z and 0-9), hyphens, and underscores. It must start and end with alphanumeric characters". Since we don't have similar restriction for environment we have to encode it before storing it as a job group. We can't use standard encodings like url or base64 encoding since they both require additional characters to be supported (%, =, uppercase letters etc). Instead we must create a custom conversion like lowercasing all letters and removing special characters. This is a lossy irreversible operation that makes it impossible to retrieve the original value from the job group. Additionally it creates the risk of having naming conflicts. Example: If two services are called the same but with different casing, they'll be converted to the same value.

User editable If a user removes or edits a job group the integration with APM will break. This will surprise users so we should avoid this happening. Job groups are user facing and doesn't come with any warning that editing them might break integrations. This is understandable since job groups were not made for the purpose we are using them for. In short: using job groups for metadata is both complex and unreliable.

Suggestion Instead of storing metadata that we want to query for as job groups I suggest something like the "system tags" that alerting is also looking into. System tags are similar to user facing tags (job groups) except they do not restrict the character set anymore than elasticsearch does, and they will not be editable (and perhaps not even displayed) in the ui.

Timeline The plan is still to ship the new ML job in 7.9 but we'll need to find a way to retrieve ml jobs by service.environment somehow. We could ignore the drawbacks listed above and use job groups. This would introduce complexity and the integration would be fragile. We wouldn't be able to easily migrate to system tags should they become available at a later stage. Having system tags available in 7.9 is therefore a high priority to APM.

ogupte commented 4 years ago

~Another reason something like system tags would be beneficial: since they are indexed, any filtering can be done in ES. Right now in order to implement something similar with job groups, you have to fetch all ML jobs, and do any filtering/matching of group strings in app code.~

My mistake I'm thinking of something else.

peteharverson commented 4 years ago

For 7.9 APM will use the existing custom_settings field in the ML job to tag the environment, by passing a jobOverrides parameter to the ML modules setup function, in the form:

jobOverrides: [
      {
        custom_settings: {
          job_tags: { environment },
        },
      },
    ],

This custom_settings field can then be used on the Kibana-side for filtering as required by environment.

However it is acknowledged that this solution is not ideal, so as part of the on-going project to make ML jobs space-aware, work will start in 7.10 to store ML jobs as Kibana saved objects which will allow us to store meta data, such as 'system tags', as part of the saved object. This has the advantages of:

sorenlouv commented 4 years ago

This sounds great @peteharverson! I'll add that in addition to storing metadata in custom_settings we also store it in groups. Have you thought more about the migration from custom_settings and groups to the saved objects? Would this happen automatically when the user upgrades?

peteharverson commented 4 years ago

Good question @sqren . Yes, we are planning on adding a number of checks around the Spaces / Saved Objects on start-up when the user upgrades, and this should definitely include checking for the job_tags that APM are using in 7.9 so that they can be added to the saved object meta data.

richcollier commented 4 years ago

^^ My ER from 3 years ago now has a chance! :)

peteharverson commented 3 years ago

@sqren with ML jobs being made space-aware from 7.11, we are now creating saved objects to act as wrappers around the Elasticsearch job object.

I wondered if the Tags functionality for Kibana saved objects might be a way to meet your requirements, but on first look I am thinking it won't be sufficient for your use case here as it only allows a name to be attached to a saved object e.g. production, whereas you want to be able to attach tags which comprise a field name and value e.g service.name: opbeans-node and service.environment:production. Is this view correct, in that using 'name only' tags such as service.name-opbeans-node would not be suitable, as in theory there might be 10s or 100s of possible service.name values for example? Plus it looks like you don't want these tags to be user facing or user editable (which the saved objects tags are).

If the saved object tags don't look like a solution here, we can investigate adding a job_tags property to the ML job saved objects, such as

job_tags: {
  "service.name": "opbeans-node",
   "service.environment": "production",
   "transaction.type": "request",
   "apm_ml_version": 2
}

with this replacing your current use of the custom_settings property (which is stored on the Elasticsearch job object). Saved object filtering could then be used to search for jobs by these tags.

Would appreciate your thoughts @sqren on whether you think the Kibana saved object tags would be suitable for your use case, or if you think adding a new job_tags property to the ML job saved objects would better suit your needs, or if you would prefer to keep with the existing custom_settings approach for now.

dgieselaar commented 3 years ago

@peteharverson I can't answer for @sqren, but IMHO it would be ideal if we have tags on anomalies, not just jobs. I'm looking into an issue where we are seeing the ML calls slowing down a request to 7s (from about 1.5s without it). One reason for this is because we do a capabilities call, then one to get the job ids, then another one to get the anomaly data for those jobs. Ideally we can just do one request: get all the anomaly data with tag "service.name:opbeans-java".

droberts195 commented 3 years ago

Ideally we can just do one request: get all the anomaly data with tag "service.name:opbeans-java".

The problem is that the ML jobs are now space-aware, so every call needs to be checked again the space(s) that the job is in.

ML calls slowing down a request to 7s (from about 1.5s without it)

Is there a breakdown of how much of that 7s goes on each call? Maybe there is inefficiency that can be addressed in a different way to adding tags. I think the first step in deciding what to do is to break that 7s down between the 3 ML API calls, and then again between the underlying APIs that those 3 ML APIs are calling, and look at opportunities for efficiencies.

Although it would be possible to add tags to the ML jobs that got copied into every single ML result it would be a large piece of work because it would affect all of the different ML result classes in the Java code and, with the "ML in Spaces" project it's important to realise that this wouldn't allow results to be retrieved simply by searching the ML results index because all ML APIs now need to be checked against the jobs space membership. So we should start by making sure we understand where exactly the time is going today.

dgieselaar commented 3 years ago

The problem is that the ML jobs are now space-aware, so every call needs to be checked again the space(s) that the job is in.

Why does it need to be checked? AFAICT, spaces are for organising things rather than securing things. Is that incorrect?

Is there a breakdown of how much of that 7s goes on each call?

I don't have that breakdown yet. But I'll send you a link on Slack to a screenshot (I haven't scrubbed out potentially sensitive information).