Closed jgowdyelastic closed 10 months ago
Pinging @elastic/ml-ui (:ml)
This is almost certainly because we default to the ml_standard
tokenizer for categorization, but most text
fields use the standard
tokenizer for search.
Try using the analyze API to compare the tokenization of "foo:bar" with those two analyzers. I suspect ml_standard
will split that but standard
won't.
Possibly we should specify the same tokenizer as the text field is using when doing adhoc categorization in the UI. We had to use ml_standard
for categorization jobs, because that's what the ML C++ was hardcoded to use in the days when we did tokenization in C++. Then we made ml_standard
the default for the categorize_text
aggregation too so both ways of doing categorization matched. But for perfect round-tripping the same tokenizer needs to be used everywhere.
So try specifying this when calling categorize_text
:
"categorization_analyzer" : {
"char_filter" : [
"first_line_with_letters"
],
"tokenizer" : "standard",
"filter" : [
{
"type" : "stop",
"stopwords" : [
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday",
"Mon",
"Tue",
"Wed",
"Thu",
"Fri",
"Sat",
"Sun",
"January",
"February",
"March",
"April",
"May",
"June",
"July",
"August",
"September",
"October",
"November",
"December",
"Jan",
"Feb",
"Mar",
"Apr",
"May",
"Jun",
"Jul",
"Aug",
"Sep",
"Oct",
"Nov",
"Dec",
"GMT",
"UTC"
]
},
{
"type": "limit",
"max_token_count": "100"
}
]
}
(Note ml_standard
-> standard
compared to the default one.)
If that solves it then ideally we'd map ml_standard
to whatever the analyzer specified in the mappings is for the source field instead of unconditionally using standard
.
If that solves it
That does indeed solve it. A token like foo:bar
is not split.
ideally we'd map ml_standard to whatever the analyzer specified in the mappings is for the source field instead of unconditionally using standard.
Do you think this is something that could be done for 8.12.0?
If not, do you think it's worth the UI supplying this categorization_analyzer
for the pattern analysis feature. To ensure the filters it creates work with the majority of indices.
I've created a PR to apply this change in the UI. Currently set as draft in case we don't want to use this approach.
The category filter created and applied to Discover is make up of the category key. This key is a list of the common words in all docs which match a category.
For example, for the text:
The category key is:
In order to create a filter which will match as strictly as possible, we have to use some additional parameters in the
match
query, e.g.It appears this assumes a space character between words, however the way the key is generated, a non-space character could be used. A good example of this is a colon character
:
e.g.foo:bar
will produce the keyfoo bar
, but using the search above,foo bar
will not matchfoo:bar
Below is a real example:
Message value:
Key:
Query:
The problem here is
protocol:op_msg
which is turned intoprotocol op_msg
and so no docs are matched when added to a filter in Discover.In the example below, editing the filter to replace the space with a
:
fixes this filter and produces matched documents.https://github.com/elastic/kibana/assets/22172091/d0a17c92-11a1-4e04-9158-c562df5fc964