elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.81k stars 8.2k forks source link

[ML] AIOPs - Discover filter can fail to match any documents #169523

Closed jgowdyelastic closed 10 months ago

jgowdyelastic commented 1 year ago

The category filter created and applied to Discover is make up of the category key. This key is a list of the common words in all docs which match a category.

For example, for the text:

connection accepted from 190.88.106.154:67073 #1 (1 connection now open)

The category key is:

connection accepted from connection now open

In order to create a filter which will match as strictly as possible, we have to use some additional parameters in the match query, e.g.

{
  "bool": {
    "should": [
      {
        "match": {
          "message": {
            "auto_generate_synonyms_phrase_query": false,
            "fuzziness": 0,
            "operator": "and",
            "query": "connection accepted from connection now open"
          }
        }
      }
    ]
  }
}

It appears this assumes a space character between words, however the way the key is generated, a non-space character could be used. A good example of this is a colon character : e.g. foo:bar will produce the key foo bar, but using the search above, foo bar will not match foo:bar

Below is a real example:

Message value:

command adminConsole.users command: update { update: { _id: ObjectId("b9ac648f42ae1e3a90ea") }, updateObj: { $set: { _id: ObjectId("fa9b766601e1d5a230b7"), country: "Madagascar" } }, writeConcern: { w: "majority", wtimeout: 5000 }, lsid: { id: UUID("75092c4c-87ba-4639-aeea-34189a764ab1") } } numYields:0 reslen:85 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } } } storage:{ data: { bytesWritten: 126 } } protocol:op_msg 0ms

Key:

command adminConsole.users command update update id ObjectId updateObj set id ObjectId country writeConcern w majority wtimeout lsid id UUID numYields reslen locks Global acquireCount r w Database acquireCount w Collection acquireCount w storage data bytesWritten protocol op_msg

Query:

{
  "bool": {
    "should": [
      {
        "match": {
          "message": {
            "auto_generate_synonyms_phrase_query": false,
            "fuzziness": 0,
            "operator": "and",
            "query": "command adminConsole.users command update update id ObjectId updateObj set id ObjectId country writeConcern w majority wtimeout lsid id UUID numYields reslen locks Global acquireCount r w Database acquireCount w Collection acquireCount w storage data bytesWritten protocol op_msg"
          }
        }
      }
    ]
  }
}

The problem here is protocol:op_msg which is turned into protocol op_msg and so no docs are matched when added to a filter in Discover.

In the example below, editing the filter to replace the space with a : fixes this filter and produces matched documents.

https://github.com/elastic/kibana/assets/22172091/d0a17c92-11a1-4e04-9158-c562df5fc964

elasticmachine commented 1 year ago

Pinging @elastic/ml-ui (:ml)

droberts195 commented 10 months ago

This is almost certainly because we default to the ml_standard tokenizer for categorization, but most text fields use the standard tokenizer for search.

Try using the analyze API to compare the tokenization of "foo:bar" with those two analyzers. I suspect ml_standard will split that but standard won't.

Possibly we should specify the same tokenizer as the text field is using when doing adhoc categorization in the UI. We had to use ml_standard for categorization jobs, because that's what the ML C++ was hardcoded to use in the days when we did tokenization in C++. Then we made ml_standard the default for the categorize_text aggregation too so both ways of doing categorization matched. But for perfect round-tripping the same tokenizer needs to be used everywhere.

So try specifying this when calling categorize_text:

      "categorization_analyzer" : {
        "char_filter" : [
          "first_line_with_letters"
        ],
        "tokenizer" : "standard",
        "filter" : [
          {
            "type" : "stop",
            "stopwords" : [
              "Monday",
              "Tuesday",
              "Wednesday",
              "Thursday",
              "Friday",
              "Saturday",
              "Sunday",
              "Mon",
              "Tue",
              "Wed",
              "Thu",
              "Fri",
              "Sat",
              "Sun",
              "January",
              "February",
              "March",
              "April",
              "May",
              "June",
              "July",
              "August",
              "September",
              "October",
              "November",
              "December",
              "Jan",
              "Feb",
              "Mar",
              "Apr",
              "May",
              "Jun",
              "Jul",
              "Aug",
              "Sep",
              "Oct",
              "Nov",
              "Dec",
              "GMT",
              "UTC"
            ]
          },
          {
            "type": "limit",
            "max_token_count": "100"
          }
        ]
      }

(Note ml_standard -> standard compared to the default one.)

If that solves it then ideally we'd map ml_standard to whatever the analyzer specified in the mappings is for the source field instead of unconditionally using standard.

jgowdyelastic commented 10 months ago

If that solves it

That does indeed solve it. A token like foo:bar is not split.

ideally we'd map ml_standard to whatever the analyzer specified in the mappings is for the source field instead of unconditionally using standard.

Do you think this is something that could be done for 8.12.0?

If not, do you think it's worth the UI supplying this categorization_analyzer for the pattern analysis feature. To ensure the filters it creates work with the majority of indices.

jgowdyelastic commented 10 months ago

I've created a PR to apply this change in the UI. Currently set as draft in case we don't want to use this approach.