elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.8k stars 8.19k forks source link

[Observability] [AAD] Streamline the method of saving group information in alert document #183248

Open benakansara opened 5 months ago

benakansara commented 5 months ago

Currently we have kibana.alert.instance.id in all alerts that saves comma separated group values in the alert document. We would like to have a field that provides information in the form of {field, value} pair, and allows for individual {field, value} to be searchable/queryable in the alert document. The requirement of this field is discussed in the RFC here.

Based on the discussion in above RFC, the Custom threshold rule saves group information in AAD with kibana.alert.group field which is an array of { field: field-name, value: field-value }.

We need to streamline the method of saving group information in AAD across all Observability rules.

Use cases

Rules where group info should be saved in its dedicated field in alert document

Acceptance criteria

elasticmachine commented 5 months ago

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

benakansara commented 5 months ago

We have a different set of context variables for "Summary of alerts" action frequency. We don't have separate context.group or context.groupByKeys context variables. Instead we can access all AAD fields when creating action message. This would be one of the use cases for using group info from AAD. I used alerts.all.data variable to build alert action message. In this case, we need to rely on the index if AAD field has an array like structure.

{{#alerts.all.data}}
Host name: {{kibana.alert.group.0.value}}
Container ID: {{kibana.alert.group.1.value}}
{{/alerts.all.data}}
benakansara commented 5 months ago

We can introduce two fields - one for search use case, one for iterating over.

jasonrhodes commented 4 months ago

~The only problem I see there is I think kibana.alert.group is already used in some rule types, as a string -- other than that I am +1 on this idea, generally.~

Update: @benakansara corrected me that this is only the case in context.*, not at the alerts as data level.

maryam-saeidi commented 2 weeks ago

The current state of observability rules for the following two fields:

Rule type kibana.alert.group
APM rule No group by fields
Inventory No group by fields, has a predefined list of fields that can be selected from
Metric threshold
Custom threshold
Log threshold
SLO burn rate
ES Query
Anomaly detection Didn't see a group field there, maybe we should check what fields exist in an anomaly job
benakansara commented 2 weeks ago

As described by @maryam-saeidi in above comment, atm we are storing group info in kibana.alert.group as an array in some of the rules.

Using this field in search could result in false positives, as seen in one of the examples below.

If user filters alerts with kibana.alert.group.field: "service.name" and kibana.alert.group.value: *product* KQL filter on alerts search bar, they would expect to see services with only "product" in their name, but this would return anything that matches "product" in any of the kibana.alert.group.value values in the document. In below example using otel-demo data, I created a rule with group by on service.name and transaction.name, and using above filter returns services without "product" in their name because some transactions have "product" in transaction names.

Image

benakansara commented 2 weeks ago

Based on discussion offline with @jasonrhodes and @maryam-saeidi, and considering search returning false positives with current approach, I think we have two options to streamline saving group in alert document:

1) kibana.alert.groupByKeys or kibana.alert.groupings field as object type with dynamic mapping enabled

With this approach -

The downside which was also captured in RFC is mapping explosion. However, as @dgieselaar mentioned on slack - The chance of a mapping explosion seems practically non-existing because a grouping key is explicitly set by the user and not generated from data (only the values are)

Even if there are 100s of rules configured by user, the set of group by fields would be quite limited (with overlapping group by fields between rules)

2) kibana.alert.groupByKeys or kibana.alert.groupings field as flattened type

With this approach -

The downside with keeping only one flattened field is that we won't have KQL auto-suggestion. We need another existing kibana.alert.group array field which works for KQL auto-suggestion but can lead to misleading results in some cases.

My proposal would be option 1) with dynamic mapping enabled.

maryam-saeidi commented 2 weeks ago

@elastic/response-ops (@ymao1 @pmuellr )

Our team is revisiting the alert grouping field topic that we discussed a year ago (document). We are considering the possibility of using flatten fields with enabled dynamic mapping as mentioned above (second option in the document) considering the fact that now we have a setting to ignore dynamic fields above the field limit and with assuming the number of fields that a user would select for group by fields is probably manageable.

Any feedback/concerns about selecting this approach?

benakansara commented 2 weeks ago

@maryam-saeidi small clarification: for search use case - If we use flattened type, we don't need dynamic mapping. If we keep field type object, we would need dynamic mapping (updated my earlier comment to reflect this)

benakansara commented 2 weeks ago

Example of index mapping

1) object type with dynamic mapping

"mappings": {
    "properties": {
      "kibana.alert.groupByKeys": {
        "dynamic": true, 
         "properties": {}
    }
  }
}

2) flattened type

"mappings": {
    "properties": {
      "kibana.alert.groupByKeys": {
        "type": "flattened"
    }
  }
}
pmuellr commented 2 weeks ago

Dang, poking around the Kibana codebase, I can see security rules are using fields kibana.alert.group.id and kibana.alert.group.index, both as aliases of other security-specific fields.

Not clear to me yet, but if there's still a plan to use kibana.alert.group (maybe not?, still reading comments above, thought I should point this out tho), this seems problematic in that we'd have mapping issues searching across o11y and security alerts indices.

pmuellr commented 2 weeks ago

I believe security had us change some things to flattened to make some searches easier/possible, but I think there are also some down-sides; maybe the values are always treated as strings? Feels like that wouldn't be an issue if we just want to track a group name and value that is always a string. Are there grouping values that could be dates, numbers, etc?

Also, at the time, I believe KQL didn't support nested fields, so that wasn't an option. Maybe it does today? Would we even need KQL support - basically for UX? I believe nested fields also aren't supported in ES|QL at the moment, which may be another good reason to not use nested.

dgieselaar commented 2 weeks ago

@pmuellr do you have specific objections around a dynamically mapped object which only accepts strings under kibana.alert.grouping, similar how SLOs use slo.grouping? I think flattened comes with other downsides that I'd like us to avoid (e.g. it not showing up in field caps IIUC).

Dang, poking around the Kibana codebase, I can see security rules are using fields kibana.alert.group.id and kibana.alert.group.index, both as aliases of other security-specific fields.

Do you have a link to the code where this happens?

dgieselaar commented 2 weeks ago

@benakansara I think groupByKeys gives the impression that it's a set of the grouping keys (ie an array of strings), instead of the field-value pairs that are a result of the grouping key (ie a plain key-value object).

pmuellr commented 2 weeks ago

do you have specific objections around a dynamically mapped object which only accepts strings

I don't have much experience with dynamically mapped objects, which is my main objection :-). Convince me!

The chance of a mapping explosion seems practically non-existing because a grouping key is explicitly set by the user and not generated from data (only the values are)

I have more experience with flattened than dynamic at this point,and seems practically non-existing doesn't give me great feels :-)

What are the other downsides of flattened? Maybe can't be used in aggs or other contexts where dynamic can? I know there are (or used to be) limitations in the value types (always treated as keyword?), which I wouldn't be a problem in this case, unless we allow grouping by non-keyword types like numeric / date ...

pmuellr commented 2 weeks ago

Dang, poking around the Kibana codebase, I can see security rules are using fields kibana.alert.group.id and kibana.alert.group.index, both as aliases of other security-specific fields.

Do you have a link to the code where this happens?

I just did a search of the Kibana codebase in vscode for kibana.alert.group - it shows both kibana.alert.group with field/value and id/index variants. This doesn't seem right to me, but I haven't investigated further.

In any case, using a new field(s) for this seems wise :-)

dgieselaar commented 2 weeks ago

@pmuellr:

I don't have much experience with dynamically mapped objects, which is my main objection :-). Convince me!

Happy to!

I have more experience with flattened than dynamic at this point,and seems practically non-existing doesn't give me great feels :-)

I really think this is a non-issue. Someone would programmatically need to generate random-ish grouping keys and create rules with them. Now maybe someone will do that... but I don't think the chance of that happening is bigger than let's say someone indexing into the .alerts index directly.

What are the other downsides of flattened? Maybe can't be used in aggs or other contexts where dynamic can? I know there are (or used to be) limitations in the value types (always treated as keyword?), which I wouldn't be a problem in this case, unless we allow grouping by non-keyword types like numeric / date ...

It doesn't show up in field caps so you cannot do things like autocomplete on them or verify the existence of a field ahead of time (which we need for ES|QL for instance).

benakansara commented 2 weeks ago

I think groupByKeys gives the impression that it's a set of the grouping keys (ie an array of strings), instead of the field-value pairs that are a result of the grouping key (ie a plain key-value object).

@dgieselaar I agree. groupings or group sounds better to me. The idea of naming it groupByKeys comes from the fact that we have a context variable called context.groupByKeys and it would be good to have consistent naming (we couldn't use context.group as it was already present with string type).

jasonrhodes commented 1 week ago

Thanks @maryam-saeidi and @benakansara — I talked to Mary yesterday and said I was leaning toward using the dynamic mapping for this field because the benefits seem good and the risk of having a customer group alerts by hundreds of different fields seems small. To help bring some clarity to this conversation, I had a look at what our current situation is. It looks like for an index like .internal.alerts-observability.metrics.alerts-default-000001, the field limit has been upped to 2500 as-is:

{
  "settings": {
    "index": {
      "mapping": {
        "total_fields": {
          "limit": "2500"
        },
        "ignore_malformed": "true"
      }
    }
  }
}

When I do a search on that same index to see current fields:

curl -s -XGET "/.alerts-observability.metrics*/_field_caps?fields=*" | jq '.fields|length'

I get 2123. So I think we need to ask whether we feel comfortable with that amount of room when introducing a dynamic field, especially if there could be other reasons these mappings could grow beyond just this one field. Can we bump this number up without cuasing issues? Would it be conceivable for a customer to group alerts by, say, 100 different fields? 200? How confident are we there? I think it's reasonable to assume that number likely would never approach 1000, but we should also know the story of what would happen if a customer did exceed this limit by grouping in an unexpected way.

dgieselaar commented 1 week ago

How do we end up with over 2000 fields? Are we using statically mapped ECS fields instead of ecs@mappings which uses dynamic templates?

maryam-saeidi commented 1 week ago

@jasonrhodes Thanks for sharing the information about the current limitations. I was wondering if it would be an option to enable dynamic mapping on a cluster similar to one of ours (like QA or any cluster in which we have a lot of alerting rules) and see how many fields will be added. This would give a sample for the question: "Would it be conceivable for a customer to group alerts by, say, 100 different fields? 200? ". We can also use that instance to test manually and see this approach in action.

By adding dynamic mapping, we will save the mappings of ECS fields two times in this case, one at the root level and one for this group by field which does not have an issue by default, just to keep it in mind in case there are possible improvements to reuse mappings at different levels. (I am not sure if there is such a feature.)

Another topic to discuss is whether using dynamic mapping can cause an issue regarding the type of field that will be added dynamically. Would it be possible that the type that is added dynamically does not match the actual type of the field? Can it be an issue?

If a user adds a group by field mistakenly, would that be a problem besides having an extra unused mapping? Is there a possibility that users add many group by fields mistakenly? If yes, What is the process of correction? (Similar to the question, "what would happen if a customer did exceed this limit by grouping in an unexpected way.") What would happen if the user changed the shape of their data and renamed the list of groups by fields by migrating to a new set of field names? Can we have a clean-up process for such a case?

And, since we have fieldsForAAD that limits the fields we show in the alert table and auto-suggestions in the KQL bar, how would it work with dynamic mapping? (Would it work as expected if we add kibana.alert.groupings.* or something similar to that list?)

maryam-saeidi commented 1 week ago

How do we end up with over 2000 fields? Are we using statically mapped ECS fields instead of ecs@mappings which uses dynamic templates?

@dgieselaar I think it is related to using the .alerts-ecs-mappings component template for alerting indices, which has the mapping of all ECS fields statically.

dgieselaar commented 1 week ago

@maryam-saeidi are we required to use it? @pmuellr Is this legacy, or can we switch to ecs@mappings? We had this issue years ago when RAC started, and we had very long discussions where IIRC we agreed not to map all ECS fields by default, precisely because of this reason.

ymao1 commented 1 week ago

@dgieselaar That is currently still the way we introduce ECS mappings into the alerts documents. I will bring up addressing this issue with the team.

dgieselaar commented 1 week ago

@ymao1 yes please, it would be great if we can address it on the short term - it seems counterproductive to have discussions about adding dozens of fields when we're needlessly creating explicit mappings for around 2000 of them of which we only use a handful by default.

jasonrhodes commented 1 week ago

Feels like this conversation is beginning to spin its wheels a bit, putting us at risk of still not having a consistent and usable way to do what we're hoping to do. @ymao1 (cc @kobelb) it would probably be helpful to have a realistic assessment of whether that linked issue sounds viable in the short term.

Observability folks, if we continue to include ~2000 ECS fields in the alerts index mappings, would we still be comfortable introducing another dynamically mapped field for kibana.alert.grouping? Can someone confirm the specific downsides of using the flattened type for our various use-cases with this grouping field?

dgieselaar commented 1 week ago

Can someone confirm the specific downsides of using the flattened type for our various use-cases with this grouping field?

AFAIK they don't show up in _field_caps, meaning we also cannot validate the presence of the field before running the query - which is (unfortunately) a necessity for ES|QL. They also don't support anything other than keywords and a subset of queries, which I expect to be mostly fine (that is, until someone uses a boolean or a long for grouping :) ).

FWIW, I don't think we should block this based on the mappings issue. Ideally we just go with dynamically mapped objects, and if the field limit becomes an issue, we have something that will have a much bigger impact (switching to ecs@mapping) rather than us having to use workarounds like flattened fields.

jasonrhodes commented 6 days ago

Thanks, @dgieselaar -- @andrewvc I'd love your take on this re: going with a dynamically mapped object.

maryam-saeidi commented 5 days ago

@jasonrhodes @andrewvc I believe being able to manually test this approach on a cluster similar to one of ours (like QA or any cluster in which we have a lot of alerting rules and related data) and see this approach in action would be beneficial. I am mostly thinking about answering questions like:

And, in general, figuring out any other issues that might show themselves while testing this approach in a real-world scenario. Is this something that we can consider?

Also, we have the following open questions to answer:

  1. In case the limit size is exceeded, what can be done? Bumping the limit or ...?
  2. What is a clean-up process in case the shape of data has changed and some of the mappings are not relevant anymore, or someone mistakenly added group-by fields just for testing and ended up having unnecessary mappings for them?
  3. Or maybe a more generic question: Is there any downside to having extra mappings that are not used anymore?