[Observability] [AAD] Streamline the method of saving group information in alert document

benakansara commented 6 months ago

Currently we have kibana.alert.instance.id in all alerts that saves comma separated group values in the alert document. We would like to have a field that provides information in the form of {field, value} pair, and allows for individual {field, value} to be searchable/queryable in the alert document. The requirement of this field is discussed in the RFC here.

Based on the discussion in above RFC, the Custom threshold rule saves group information in AAD with kibana.alert.group field which is an array of { field: field-name, value: field-value }.

We need to streamline the method of saving group information in AAD across all Observability rules.

Use cases

The field should be searchable/queryable reliably without false positives
Auto-suggestion on KQL bar should suggest this field
Use in action template of "Summary of alerts" action frequency (described in comment below) without relying on index

Rules where group info should be saved in its dedicated field in alert document

Custom threshold rule - currently has kibana.alert.group array
Metric threshold rule - currently has kibana.alert.group array
Log threshold rule - currently has kibana.alert.group array
SLO burn rate rule - currently has kibana.alert.group array
Inventory threshold rule
APM Latency threshold rule
APM Failed transaction rate threshold rule
APM Error count rule

Acceptance criteria

Have same field with same structure to save group information in alert document across all Observability rules

elasticmachine commented 6 months ago

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

benakansara commented 6 months ago

We have a different set of context variables for "Summary of alerts" action frequency. We don't have separate context.group or context.groupByKeys context variables. Instead we can access all AAD fields when creating action message. This would be one of the use cases for using group info from AAD. I used alerts.all.data variable to build alert action message. In this case, we need to rely on the index if AAD field has an array like structure.

{{#alerts.all.data}}
Host name: {{kibana.alert.group.0.value}}
Container ID: {{kibana.alert.group.1.value}}
{{/alerts.all.data}}

benakansara commented 6 months ago

We can introduce two fields - one for search use case, one for iterating over.

kibana.alert.group as an array [ {field: field-name, value: field-value} ]
kibana.alert.groupByKeys as an object

jasonrhodes commented 5 months ago

~The only problem I see there is I think kibana.alert.group is already used in some rule types, as a string -- other than that I am +1 on this idea, generally.~

Update: @benakansara corrected me that this is only the case in context.*, not at the alerts as data level.

maryam-saeidi commented 1 month ago

The current state of observability rules for the following two fields:

kibana.alert.group as an array [ {field: field-name, value: field-value} ]
kibana.alert.groupByKeys as an object
- This field does not exist atm

Rule type	kibana.alert.group
APM rule	No group by fields
Inventory	No group by fields, has a predefined list of fields that can be selected from
Metric threshold	✔
Custom threshold	✔
Log threshold	✔
SLO burn rate	✔
ES Query	❌
Anomaly detection	Didn't see a group field there, maybe we should check what fields exist in an anomaly job

benakansara commented 1 month ago

As described by @maryam-saeidi in above comment, atm we are storing group info in kibana.alert.group as an array in some of the rules.

Using this field in search could result in false positives, as seen in one of the examples below.

If user filters alerts with kibana.alert.group.field: "service.name" and kibana.alert.group.value: *product* KQL filter on alerts search bar, they would expect to see services with only "product" in their name, but this would return anything that matches "product" in any of the kibana.alert.group.value values in the document. In below example using otel-demo data, I created a rule with group by on service.name and transaction.name, and using above filter returns services without "product" in their name because some transactions have "product" in transaction names.

benakansara commented 1 month ago

Based on discussion offline with @jasonrhodes and @maryam-saeidi, and considering search returning false positives with current approach, I think we have two options to streamline saving group in alert document:

1) kibana.alert.groupByKeys or kibana.alert.groupings field as object type with dynamic mapping enabled

With this approach -

we won't need two separate fields - one for KQL bar and the other for searching
KQL auto suggestion will include field name kibana.alert.group.host.name as opposed to current kibana.alert.group.value
querying is possible

The downside which was also captured in RFC is mapping explosion. However, as @dgieselaar mentioned on slack - The chance of a mapping explosion seems practically non-existing because a grouping key is explicitly set by the user and not generated from data (only the values are)

Even if there are 100s of rules configured by user, the set of group by fields would be quite limited (with overlapping group by fields between rules)

2) kibana.alert.groupByKeys or kibana.alert.groupings field as flattened type

With this approach -

querying is possible

The downside with keeping only one flattened field is that we won't have KQL auto-suggestion. We need another existing kibana.alert.group array field which works for KQL auto-suggestion but can lead to misleading results in some cases.

My proposal would be option 1) with dynamic mapping enabled.

maryam-saeidi commented 1 month ago

@elastic/response-ops (@ymao1 @pmuellr )

Our team is revisiting the alert grouping field topic that we discussed a year ago (document). We are considering the possibility of using flatten fields with enabled dynamic mapping as mentioned above (second option in the document) considering the fact that now we have a setting to ignore dynamic fields above the field limit and with assuming the number of fields that a user would select for group by fields is probably manageable.

Any feedback/concerns about selecting this approach?

benakansara commented 1 month ago

@maryam-saeidi small clarification: for search use case - If we use flattened type, we don't need dynamic mapping. If we keep field type object, we would need dynamic mapping (updated my earlier comment to reflect this)

benakansara commented 1 month ago

Example of index mapping

1) object type with dynamic mapping

"mappings": {
    "properties": {
      "kibana.alert.groupByKeys": {
        "dynamic": true, 
         "properties": {}
    }
  }
}

2) flattened type

"mappings": {
    "properties": {
      "kibana.alert.groupByKeys": {
        "type": "flattened"
    }
  }
}

pmuellr commented 1 month ago

Dang, poking around the Kibana codebase, I can see security rules are using fields kibana.alert.group.id and kibana.alert.group.index, both as aliases of other security-specific fields.

Not clear to me yet, but if there's still a plan to use kibana.alert.group (maybe not?, still reading comments above, thought I should point this out tho), this seems problematic in that we'd have mapping issues searching across o11y and security alerts indices.

pmuellr commented 1 month ago

I believe security had us change some things to flattened to make some searches easier/possible, but I think there are also some down-sides; maybe the values are always treated as strings? Feels like that wouldn't be an issue if we just want to track a group name and value that is always a string. Are there grouping values that could be dates, numbers, etc?

Also, at the time, I believe KQL didn't support nested fields, so that wasn't an option. Maybe it does today? Would we even need KQL support - basically for UX? I believe nested fields also aren't supported in ES|QL at the moment, which may be another good reason to not use nested.

dgieselaar commented 1 month ago

@pmuellr do you have specific objections around a dynamically mapped object which only accepts strings under kibana.alert.grouping, similar how SLOs use slo.grouping? I think flattened comes with other downsides that I'd like us to avoid (e.g. it not showing up in field caps IIUC).

Dang, poking around the Kibana codebase, I can see security rules are using fields kibana.alert.group.id and kibana.alert.group.index, both as aliases of other security-specific fields.

Do you have a link to the code where this happens?

dgieselaar commented 1 month ago

@benakansara I think groupByKeys gives the impression that it's a set of the grouping keys (ie an array of strings), instead of the field-value pairs that are a result of the grouping key (ie a plain key-value object).

pmuellr commented 1 month ago

do you have specific objections around a dynamically mapped object which only accepts strings

I don't have much experience with dynamically mapped objects, which is my main objection :-). Convince me!

The chance of a mapping explosion seems practically non-existing because a grouping key is explicitly set by the user and not generated from data (only the values are)

I have more experience with flattened than dynamic at this point,and seems practically non-existing doesn't give me great feels :-)

What are the other downsides of flattened? Maybe can't be used in aggs or other contexts where dynamic can? I know there are (or used to be) limitations in the value types (always treated as keyword?), which I wouldn't be a problem in this case, unless we allow grouping by non-keyword types like numeric / date ...

pmuellr commented 1 month ago

Dang, poking around the Kibana codebase, I can see security rules are using fields kibana.alert.group.id and kibana.alert.group.index, both as aliases of other security-specific fields.

Do you have a link to the code where this happens?

I just did a search of the Kibana codebase in vscode for kibana.alert.group - it shows both kibana.alert.group with field/value and id/index variants. This doesn't seem right to me, but I haven't investigated further.

In any case, using a new field(s) for this seems wise :-)

dgieselaar commented 1 month ago

@pmuellr:

I don't have much experience with dynamically mapped objects, which is my main objection :-). Convince me!

Happy to!

I have more experience with flattened than dynamic at this point,and seems practically non-existing doesn't give me great feels :-)

I really think this is a non-issue. Someone would programmatically need to generate random-ish grouping keys and create rules with them. Now maybe someone will do that... but I don't think the chance of that happening is bigger than let's say someone indexing into the .alerts index directly.

What are the other downsides of flattened? Maybe can't be used in aggs or other contexts where dynamic can? I know there are (or used to be) limitations in the value types (always treated as keyword?), which I wouldn't be a problem in this case, unless we allow grouping by non-keyword types like numeric / date ...

It doesn't show up in field caps so you cannot do things like autocomplete on them or verify the existence of a field ahead of time (which we need for ES|QL for instance).

benakansara commented 1 month ago

I think groupByKeys gives the impression that it's a set of the grouping keys (ie an array of strings), instead of the field-value pairs that are a result of the grouping key (ie a plain key-value object).

@dgieselaar I agree. groupings or group sounds better to me. The idea of naming it groupByKeys comes from the fact that we have a context variable called context.groupByKeys and it would be good to have consistent naming (we couldn't use context.group as it was already present with string type).

jasonrhodes commented 1 month ago

Thanks @maryam-saeidi and @benakansara — I talked to Mary yesterday and said I was leaning toward using the dynamic mapping for this field because the benefits seem good and the risk of having a customer group alerts by hundreds of different fields seems small. To help bring some clarity to this conversation, I had a look at what our current situation is. It looks like for an index like .internal.alerts-observability.metrics.alerts-default-000001, the field limit has been upped to 2500 as-is:

{
  "settings": {
    "index": {
      "mapping": {
        "total_fields": {
          "limit": "2500"
        },
        "ignore_malformed": "true"
      }
    }
  }
}

When I do a search on that same index to see current fields:

curl -s -XGET "/.alerts-observability.metrics*/_field_caps?fields=*" | jq '.fields|length'

I get 2123. So I think we need to ask whether we feel comfortable with that amount of room when introducing a dynamic field, especially if there could be other reasons these mappings could grow beyond just this one field. Can we bump this number up without cuasing issues? Would it be conceivable for a customer to group alerts by, say, 100 different fields? 200? How confident are we there? I think it's reasonable to assume that number likely would never approach 1000, but we should also know the story of what would happen if a customer did exceed this limit by grouping in an unexpected way.

dgieselaar commented 1 month ago

How do we end up with over 2000 fields? Are we using statically mapped ECS fields instead of ecs@mappings which uses dynamic templates?

maryam-saeidi commented 1 month ago

@jasonrhodes Thanks for sharing the information about the current limitations. I was wondering if it would be an option to enable dynamic mapping on a cluster similar to one of ours (like QA or any cluster in which we have a lot of alerting rules) and see how many fields will be added. This would give a sample for the question: "Would it be conceivable for a customer to group alerts by, say, 100 different fields? 200? ". We can also use that instance to test manually and see this approach in action.

By adding dynamic mapping, we will save the mappings of ECS fields two times in this case, one at the root level and one for this group by field which does not have an issue by default, just to keep it in mind in case there are possible improvements to reuse mappings at different levels. (I am not sure if there is such a feature.)

Another topic to discuss is whether using dynamic mapping can cause an issue regarding the type of field that will be added dynamically. Would it be possible that the type that is added dynamically does not match the actual type of the field? Can it be an issue?

If a user adds a group by field mistakenly, would that be a problem besides having an extra unused mapping? Is there a possibility that users add many group by fields mistakenly? If yes, What is the process of correction? (Similar to the question, "what would happen if a customer did exceed this limit by grouping in an unexpected way.") What would happen if the user changed the shape of their data and renamed the list of groups by fields by migrating to a new set of field names? Can we have a clean-up process for such a case?

And, since we have fieldsForAAD that limits the fields we show in the alert table and auto-suggestions in the KQL bar, how would it work with dynamic mapping? (Would it work as expected if we add kibana.alert.groupings.* or something similar to that list?)

maryam-saeidi commented 1 month ago

How do we end up with over 2000 fields? Are we using statically mapped ECS fields instead of ecs@mappings which uses dynamic templates?

@dgieselaar I think it is related to using the .alerts-ecs-mappings component template for alerting indices, which has the mapping of all ECS fields statically.

dgieselaar commented 1 month ago

@maryam-saeidi are we required to use it? @pmuellr Is this legacy, or can we switch to ecs@mappings? We had this issue years ago when RAC started, and we had very long discussions where IIRC we agreed not to map all ECS fields by default, precisely because of this reason.

ymao1 commented 1 month ago

@dgieselaar That is currently still the way we introduce ECS mappings into the alerts documents. I will bring up addressing this issue with the team.

dgieselaar commented 1 month ago

@ymao1 yes please, it would be great if we can address it on the short term - it seems counterproductive to have discussions about adding dozens of fields when we're needlessly creating explicit mappings for around 2000 of them of which we only use a handful by default.

jasonrhodes commented 1 month ago

Feels like this conversation is beginning to spin its wheels a bit, putting us at risk of still not having a consistent and usable way to do what we're hoping to do. @ymao1 (cc @kobelb) it would probably be helpful to have a realistic assessment of whether that linked issue sounds viable in the short term.

Observability folks, if we continue to include ~2000 ECS fields in the alerts index mappings, would we still be comfortable introducing another dynamically mapped field for kibana.alert.grouping? Can someone confirm the specific downsides of using the flattened type for our various use-cases with this grouping field?

dgieselaar commented 1 month ago

Can someone confirm the specific downsides of using the flattened type for our various use-cases with this grouping field?

AFAIK they don't show up in _field_caps, meaning we also cannot validate the presence of the field before running the query - which is (unfortunately) a necessity for ES|QL. They also don't support anything other than keywords and a subset of queries, which I expect to be mostly fine (that is, until someone uses a boolean or a long for grouping :) ).

FWIW, I don't think we should block this based on the mappings issue. Ideally we just go with dynamically mapped objects, and if the field limit becomes an issue, we have something that will have a much bigger impact (switching to ecs@mapping) rather than us having to use workarounds like flattened fields.

jasonrhodes commented 1 month ago

Thanks, @dgieselaar -- @andrewvc I'd love your take on this re: going with a dynamically mapped object.

maryam-saeidi commented 1 month ago

@jasonrhodes @andrewvc I believe being able to manually test this approach on a cluster similar to one of ours (like QA or any cluster in which we have a lot of alerting rules and related data) and see this approach in action would be beneficial. I am mostly thinking about answering questions like:

Would it be possible for the type that is added dynamically not to match the actual field type? Can it be an issue?
Seeing a sample number group-by fields in action.

And, in general, figuring out any other issues that might show themselves while testing this approach in a real-world scenario. Is this something that we can consider?

Also, we have the following open questions to answer:

In case the limit size is exceeded, what can be done? Bumping the limit or ...?
What is a clean-up process in case the shape of data has changed and some of the mappings are not relevant anymore, or someone mistakenly added group-by fields just for testing and ended up having unnecessary mappings for them?
Or maybe a more generic question: Is there any downside to having extra mappings that are not used anymore?

jasonrhodes commented 1 month ago

It should be fairly simple to run some queries against our overview cluster for a rough idea of how many group by fields are in use there.

I defer to @andrewvc on the rest of these questions. I don't have a great sense of how big the risk is if we move forward with a dynamically mapped field in this kind of shared index space, but I'm also a bit nervous we could overthink things and spend too much time being defensive about it, when most of these questions might already be unanswered with regard to what happens in these same scenarios if the (mostly irrelevant) ECS mappings grow, etc.

maryam-saeidi commented 4 weeks ago

A possible alternative is using a painless script to filter out false positives.

Example painless query

``` GET .alerts*/_search { "query": { "bool": { "filter": [ { "bool": { "filter": [ { "bool": { "should": [ { "query_string": { "fields": [ "kibana.alert.group.value" ], "query": "container-name" } } ], "minimum_should_match": 1 } }, { "bool": { "should": [ { "match_phrase": { "kibana.alert.group.field": "host.name" } } ], "minimum_should_match": 1 } } ] } }, { "script": { "script": { "source": "for (int i=0; i < doc['kibana.alert.group.field'].length; i++) { if (doc['kibana.alert.group.field'][i] == params.groupName && doc['kibana.alert.group.value'][i] == params.groupValue) return true;} return false;", "params": { "groupName": "host.name", "groupValue": "container-name" } } } } ] } } } ```

pmuellr commented 3 weeks ago

Coming back around to this discussion, thanks for the ping @maryam-saeidi !

My current thought is dynamic mapping for some objects in the mappings is probably ok, though we should obviously think it through:

what is the expected cardinality of the fields?
how are we currently dealing with the max # of fields index.mapping.total_fields.limit, and how does that change?
how are we currently dealing with the max # of dynamic fields index.mapping.total_fields.ignore_dynamic_beyond_limit, and how does that change?
how do we find out when we hit a limit?
what do we do when we hit a limit?
what happens when a limit is hit, and causes an SDH, and how do we figure out that's the problem?

Guessing that for these grouping fields, we'll be fine. The values are field names, right? So they'd basically become the same key under this new object. I'd guess within a project/deployment, the cardinality is fine. The problems you might expect would be with something like date-based field names, seems unlikely folks would be creating "random" key values.

If we do this, I'm sure we'll do more of this :-). So it would be good to have some experience of what happens when we hit the limits. I suspect it's kinda silent, assuming we can set a limit and have ES continue, presumably ignoring new "fields". And thus an SDH-generator.

@mikecote @ymao1 thoughts?

pmuellr commented 3 weeks ago

Beyond the mappings, there was some thought about how the context variables would be accessed, and I think that's a good thing to think about as well. Seems like we would actually want an "ordered dictionary" kind of collection, since I think the current shape doesn't tell you the "order" of the grouping. But since neither JS nor ES support that, do we need a separate array of the keys in grouping order, so someone could iterate over them that way? JS actually does sort of support ordering of properties in objects, but I'd like to not depend on that, as you can lose the orderings in different ways. I'd like to see some mustache template examples accessing these fields.

I think the relevance of the mustache fields is < the mappings; the mappings are hugely important, the mustache fields - we can improve over time, or probably find potentially verbose solutions to whatever shape they are given. But still something to think about.

ymao1 commented 3 weeks ago

what is the expected cardinality of the fields? how are we currently dealing with the max # of fields index.mapping.total_fields.limit, and how does that change? how are we currently dealing with the max # of dynamic fields > index.mapping.total_fields.ignore_dynamic_beyond_limit, and how does that change? how do we find out when we hit a limit? what do we do when we hit a limit? what happens when a limit is hit, and causes an SDH, and how do we figure out that's the problem?

I think the concern is that the number of dynamic fields underneath the group object could grow unbounded and we could reach the total fields limit at which point the only resolution would be to increase the total fields limit which would require a new release. This is something we used to deal with in the rule registry when the index resources were installed as needed. Since we started installing alert resources on Kibana startup, we are able to catch these issues during development time (reaching the total fields limit).

However, if the number of groups directly correlates with the number of alerts that could be generated during a single execution, we do already cap that value (default 1000) so maybe that's ok? Or does the possible groups correlate with all possible groups that have been cumulatively generated by the rule, in which case it could grow unbounded? I don't have a good sense of this here.

pmuellr commented 3 weeks ago

I think the concern is that the number of dynamic fields underneath the group object could grow unbounded and we could reach the total fields limit at which point the only resolution would be to increase the total fields limit which would require a new release. This is something we used to deal with in the rule registry when the index resources were installed as needed. Since we started installing alert resources on Kibana startup, we are able to catch these issues during development time (reaching the total fields limit).

Ya, I guess we will need some estimate on the cardinality, and then increase the current max we have by that amount.

However, if the number of groups directly correlates with the number of alerts that could be generated during a single execution, we do already cap that value (default 1000) so maybe that's ok? Or does the possible groups correlate with all possible groups that have been cumulatively generated by the rule, in which case it could grow unbounded? I don't have a good sense of this here.

I believe the coorelation is how many fields they group by, over all rules in a single "index". So, if they only ever grouped by the same 3 fields over all their rules, there would be 3 new fields.

dgieselaar commented 3 weeks ago

@ymao1 @pmuellr if we run into the field limit we have https://github.com/elastic/kibana/issues/168497. That issue has been open for a year, and I'd like to use any objections around this as a forcing function to actually get that ticket done - the value of getting that over the finish line seems immense and immediately makes this discussion much simpler. I remember I spent a lot of time during RAC arguing about why statically mapping ECS fields in all AAD indices is a bad idea. The conclusion back then was that only Security AAD indices would do this so I'm not sure why this has now been applied to all AAD indices, including the Observability ones. I am pushing on this because it does not make sense to me to have a discussion about adding a few dozen of fields (tops) when we have the opportunity to cut back the amount of mapped fields by two orders of magnitude.

@pmuellr your questions make sense, however, they are problems that exist today. I don't expect this feature to materially change the amount of mapped fields. For reference, the cardinality of kibana.alert.group.field in the overview cluster is 19, for 1 million alerts.

ymao1 commented 3 weeks ago

@dgieselaar I'll move that issue back into triage and we'll see if we can prioritize it.

andrewvc commented 3 weeks ago

Apologies for missing the previous pings. I'm +1 on reducing the current field usage along the lines @dgieselaar proposes and also on the dynamic mappings. I think it's the best balance of flexibility and performance.

@pmuellr @ymao1 do we have any rough estimates in terms of effort / time to deliver here?

maryam-saeidi commented 2 weeks ago

Hi everyone,

I created a PoC to test dynamic mapping and did some tests and here are my findings:

In case of hitting the limit, if we don't have index.mapping.total_fields.ignore_dynamic_beyond_limit enabled (PR), we will get the following error, and that alert will not be reported (the rule execution is successful):
```
Error writing alerts for observability.rules.custom_threshold:53301b6a-5bb6-42ee-a8ba-ac3d352dbaf7 'Custom threshold'
```
If we enable index.mapping.total_fields.ignore_dynamic_beyond_limit, then the extra fields will be saved, but they will not be mapped, and we will see these fields in the _ignored field as shown below:
We can enable dynamic mapping for the string type, but apparently, other types, such as long/double, are cast to a string, if possible. From the UI, we don't allow selecting those fields, but if someone provides them via API, we will generate alerts based on those, and a dynamic mapping will be saved for them.
When we hit the limit, we can increase the mapping limit, but we need to help the user investigate the issue and possibly get rid of the unused mappings if that is the case.
- If we increase the mapping limit, then the existing alert documents will be updated in the next execution and _ignore fields will be removed.
- If we roll over the alert index, then all the dynamic mappings will be removed, and the new index will start with a fresh mapping. However, when using the alert search bar, we will see mappings from all the indices behind the alert index alias, so to remove those mappings, we need to somehow get rid of the old index, not sure how. (Maybe something to consider in the alert Archiving / Deletion strategy) (Thanks, @P1llus, for pointing this out.) I haven't fully tested this approach yet.
During a discussion with @P1llus, we also discussed how we can surface _ignored issues to the user and possibly help them with some instructions on how to debug this. But if we solve the ECS static mapping, maybe we wouldn't need to focus on this as the chance of users hitting the limit seems small.
It is worth mentioning that in the case of ECS group by fields, these fields will be mapped twice (once at the root level and once in the grouping field), but given our expectation of having a few numbers of the group by fields, mapping them twice should be fine.
I don't know the possibility of this issue, but if we have a mapping available in an old index (alert-index-1) and not have this mapping available in a new index (alert-index-2), then in the alert search box, it is possible to search based on that field but the new alerts will not be filtered as expected.

In general, I see the main issue with hitting the limit is not being able to search the new fields, but the alerts will still be generated, and we can see the data in the alert flyout (if we enable index.mapping.total_fields.ignore_dynamic_beyond_limit), so maybe we can consider moving forward with dynamic groups even before having dynamic ECS fields mapping, especially considering the sample data provided above:

In a sample mentioned here, we still have 377 (2500 - 2123) field mapping available
Cardinality of kibana.alert.group.field is 19 in the overview cluster (comment)

I also checked the overview cluster, and here are the number of fields for different alert indices (I used `/_field_caps?fields=*` for this purpose)	Index	Number of field mappings
.alerts-observability.metrics*	2121	379
.alerts-observability.threshold*	2091	409
.alerts-observability.logs*	2121	379

pmuellr commented 2 weeks ago

maybe we can consider moving forward with dynamic groups even before having dynamic ECS fields mapping

We discussed this in a ResponseOps call last week, and concur. They are similar but different, and I think we will get a bit more experience with the dyanmic fields by starting with just the dynamic groups.

Seems like we will want to add something to the framework alert writer, to catch the _ignored fields (don't think we do today), and surface them somehow.

mikecote commented 2 weeks ago

I'm catching up with the issue and one question I don't see asked is why we are not using the alert ECS mappings at the root level to accomplish this story? I'm sure there's a reason for it but it's not clear to me after reading the use cases on the GitHub issue..

The field should be searchable/queryable reliably without false positives

Use in action template of "Summary of alerts" action frequency (described in https://github.com/elastic/kibana/issues/183248#issuecomment-2107348632 below) without relying on index

The root fields seem to work for this use case, and I'm assuming the values are already populated at the root level today?

Auto-suggestion on KQL bar should suggest this field

Maybe is this requirement that needs something special. Is there something where we only want auto complete on the group by fields?

We structured the alert documents in a way that this data can be surfaced at the root and then leverage this structure for maintenance windows and conditional actions. It would feel inconsistent if we provided multiple sources to accomplish the same thing.

maryam-saeidi commented 2 weeks ago

why we are not using the alert ECS mappings at the root level to accomplish this story? The root fields seem to work for this use case, and I'm assuming the values are already populated at the root level today?

Yes, that is correct; we already saved the group by keyword ECS fields at the root level. The issue is related to handling groups that are not ECS fields. For example, in Otel data, we have k8s.cluster.name instead of orchestrator.cluster.name.

mikecote commented 1 week ago

Thanks @maryam-saeidi. I discussed with the team, and we feel comfortable if we find a way to implement this change for kibana.alert.group within the following constraints:

Have a guardrail to prevent the number of fields and mappings stored within kibana.alert.group to go beyond, say, 25.
Have all fields mapped as keyword to prevent type mismatches
Doesn't fail rule execution when guardrail kicks in, perhaps puts the rule in a warning state instead

If you're good within those constraints, we'd be happy to have you or someone else prototype this for the team to review.

andrewvc commented 1 week ago

I'm +1 on what @mikecote proposes, I'm curious about how we'd enforce the guardrail, if we have a place where we can easily do that validation that's great, I'm just curious where it'd go.

dgieselaar commented 1 week ago

@mikecote FWIW, Mary already put up a POC here: https://github.com/elastic/kibana/pull/199298. Is your ask to include some kind of guardrails in that POC?

FWIW, in SLOs the fields are called slo.groupings.*. I like singular better than plural (which is how we commonly name fields) but maybe it's more important to be consistent with the SLO field names in this case?

maryam-saeidi commented 1 week ago

Echoing what Dario mentioned above, I did a PoC, and I tried to find an option to limit the number of dynamic mappings for a field but didn't see such an option. Any proposal on how we can achieve that?

FWIW, in SLOs the fields are called slo.groupings.*. I like singular better than plural (which is how we commonly name fields) but maybe it's more important to be consistent with the SLO field names in this case?

Good point! If we have a similar definition of saving group information in slo.groupings, using the same name might not be a bad idea.

mikecote commented 1 week ago

I tried to find an option to limit the number of dynamic mappings for a field but didn't see such an option. Any proposal on how we can achieve that?

I don't have an idea at this time. I think that would be the last piece left for the PoC, to find a way to guardrail so we guarantee only a limited number of fields get mapped. Is that something you could take time to research? Would be curious to see what options exist on how this could be done in Kibana side or Elasticsearch side?

elastic / kibana