Open benakansara opened 6 months ago
Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)
We have a different set of context variables for "Summary of alerts" action frequency. We don't have separate context.group
or context.groupByKeys
context variables. Instead we can access all AAD fields when creating action message. This would be one of the use cases for using group info from AAD. I used alerts.all.data
variable to build alert action message. In this case, we need to rely on the index if AAD field has an array like structure.
{{#alerts.all.data}}
Host name: {{kibana.alert.group.0.value}}
Container ID: {{kibana.alert.group.1.value}}
{{/alerts.all.data}}
We can introduce two fields - one for search use case, one for iterating over.
kibana.alert.group
as an array [ {field: field-name, value: field-value} ]kibana.alert.groupByKeys
as an object~The only problem I see there is I think kibana.alert.group
is already used in some rule types, as a string -- other than that I am +1 on this idea, generally.~
Update: @benakansara corrected me that this is only the case in context.*, not at the alerts as data level.
The current state of observability rules for the following two fields:
Rule type | kibana.alert.group |
---|---|
APM rule | No group by fields |
Inventory | No group by fields, has a predefined list of fields that can be selected from |
Metric threshold | ✔ |
Custom threshold | ✔ |
Log threshold | ✔ |
SLO burn rate | ✔ |
ES Query | ❌ |
Anomaly detection | Didn't see a group field there, maybe we should check what fields exist in an anomaly job |
As described by @maryam-saeidi in above comment, atm we are storing group info in kibana.alert.group
as an array in some of the rules.
Using this field in search could result in false positives, as seen in one of the examples below.
If user filters alerts with kibana.alert.group.field: "service.name" and kibana.alert.group.value: *product*
KQL filter on alerts search bar, they would expect to see services with only "product" in their name, but this would return anything that matches "product" in any of the kibana.alert.group.value
values in the document. In below example using otel-demo data, I created a rule with group by on service.name
and transaction.name
, and using above filter returns services without "product" in their name because some transactions have "product" in transaction names.
Based on discussion offline with @jasonrhodes and @maryam-saeidi, and considering search returning false positives with current approach, I think we have two options to streamline saving group in alert document:
1) kibana.alert.groupByKeys
or kibana.alert.groupings
field as object
type with dynamic mapping enabled
With this approach -
kibana.alert.group.host.name
as opposed to current kibana.alert.group.value
The downside which was also captured in RFC is mapping explosion. However, as @dgieselaar mentioned on slack - The chance of a mapping explosion seems practically non-existing because a grouping key is explicitly set by the user and not generated from data (only the values are)
Even if there are 100s of rules configured by user, the set of group by fields would be quite limited (with overlapping group by fields between rules)
2) kibana.alert.groupByKeys
or kibana.alert.groupings
field as flattened
type
With this approach -
The downside with keeping only one flattened
field is that we won't have KQL auto-suggestion. We need another existing kibana.alert.group
array field which works for KQL auto-suggestion but can lead to misleading results in some cases.
My proposal would be option 1) with dynamic mapping enabled.
@elastic/response-ops (@ymao1 @pmuellr )
Our team is revisiting the alert grouping field topic that we discussed a year ago (document). We are considering the possibility of using flatten fields with enabled dynamic mapping as mentioned above (second option in the document) considering the fact that now we have a setting to ignore dynamic fields above the field limit and with assuming the number of fields that a user would select for group by fields is probably manageable.
Any feedback/concerns about selecting this approach?
@maryam-saeidi small clarification:
for search use case - If we use flattened
type, we don't need dynamic mapping. If we keep field type object
, we would need dynamic mapping
(updated my earlier comment to reflect this)
Example of index mapping
1) object
type with dynamic mapping
"mappings": {
"properties": {
"kibana.alert.groupByKeys": {
"dynamic": true,
"properties": {}
}
}
}
2) flattened
type
"mappings": {
"properties": {
"kibana.alert.groupByKeys": {
"type": "flattened"
}
}
}
Dang, poking around the Kibana codebase, I can see security rules are using fields kibana.alert.group.id
and kibana.alert.group.index
, both as aliases of other security-specific fields.
Not clear to me yet, but if there's still a plan to use kibana.alert.group
(maybe not?, still reading comments above, thought I should point this out tho), this seems problematic in that we'd have mapping issues searching across o11y and security alerts indices.
I believe security had us change some things to flattened
to make some searches easier/possible, but I think there are also some down-sides; maybe the values are always treated as strings? Feels like that wouldn't be an issue if we just want to track a group name and value that is always a string. Are there grouping values that could be dates, numbers, etc?
Also, at the time, I believe KQL didn't support nested
fields, so that wasn't an option. Maybe it does today? Would we even need KQL support - basically for UX? I believe nested fields also aren't supported in ES|QL at the moment, which may be another good reason to not use nested
.
@pmuellr do you have specific objections around a dynamically mapped object which only accepts strings under kibana.alert.grouping
, similar how SLOs use slo.grouping
? I think flattened
comes with other downsides that I'd like us to avoid (e.g. it not showing up in field caps IIUC).
Dang, poking around the Kibana codebase, I can see security rules are using fields kibana.alert.group.id and kibana.alert.group.index, both as aliases of other security-specific fields.
Do you have a link to the code where this happens?
@benakansara I think groupByKeys
gives the impression that it's a set of the grouping keys (ie an array of strings), instead of the field-value pairs that are a result of the grouping key (ie a plain key-value object).
do you have specific objections around a dynamically mapped object which only accepts strings
I don't have much experience with dynamically mapped objects, which is my main objection :-). Convince me!
The chance of a mapping explosion seems practically non-existing because a grouping key is explicitly set by the user and not generated from data (only the values are)
I have more experience with flattened
than dynamic
at this point,and seems practically non-existing doesn't give me great feels :-)
What are the other downsides of flattened
? Maybe can't be used in aggs or other contexts where dynamic
can? I know there are (or used to be) limitations in the value types (always treated as keyword?), which I wouldn't be a problem in this case, unless we allow grouping by non-keyword types like numeric / date ...
Dang, poking around the Kibana codebase, I can see security rules are using fields kibana.alert.group.id and kibana.alert.group.index, both as aliases of other security-specific fields.
Do you have a link to the code where this happens?
I just did a search of the Kibana codebase in vscode for kibana.alert.group
- it shows both kibana.alert.group
with field/value
and id/index
variants. This doesn't seem right to me, but I haven't investigated further.
In any case, using a new field(s) for this seems wise :-)
@pmuellr:
I don't have much experience with dynamically mapped objects, which is my main objection :-). Convince me!
Happy to!
I have more experience with flattened than dynamic at this point,and seems practically non-existing doesn't give me great feels :-)
I really think this is a non-issue. Someone would programmatically need to generate random-ish grouping keys and create rules with them. Now maybe someone will do that... but I don't think the chance of that happening is bigger than let's say someone indexing into the .alerts
index directly.
What are the other downsides of flattened? Maybe can't be used in aggs or other contexts where dynamic can? I know there are (or used to be) limitations in the value types (always treated as keyword?), which I wouldn't be a problem in this case, unless we allow grouping by non-keyword types like numeric / date ...
It doesn't show up in field caps so you cannot do things like autocomplete on them or verify the existence of a field ahead of time (which we need for ES|QL for instance).
I think
groupByKeys
gives the impression that it's a set of the grouping keys (ie an array of strings), instead of the field-value pairs that are a result of the grouping key (ie a plain key-value object).
@dgieselaar I agree. groupings
or group
sounds better to me. The idea of naming it groupByKeys
comes from the fact that we have a context variable called context.groupByKeys
and it would be good to have consistent naming (we couldn't use context.group
as it was already present with string
type).
Thanks @maryam-saeidi and @benakansara — I talked to Mary yesterday and said I was leaning toward using the dynamic mapping for this field because the benefits seem good and the risk of having a customer group alerts by hundreds of different fields seems small. To help bring some clarity to this conversation, I had a look at what our current situation is. It looks like for an index like .internal.alerts-observability.metrics.alerts-default-000001
, the field limit has been upped to 2500 as-is:
{
"settings": {
"index": {
"mapping": {
"total_fields": {
"limit": "2500"
},
"ignore_malformed": "true"
}
}
}
}
When I do a search on that same index to see current fields:
curl -s -XGET "/.alerts-observability.metrics*/_field_caps?fields=*" | jq '.fields|length'
I get 2123. So I think we need to ask whether we feel comfortable with that amount of room when introducing a dynamic field, especially if there could be other reasons these mappings could grow beyond just this one field. Can we bump this number up without cuasing issues? Would it be conceivable for a customer to group alerts by, say, 100 different fields? 200? How confident are we there? I think it's reasonable to assume that number likely would never approach 1000, but we should also know the story of what would happen if a customer did exceed this limit by grouping in an unexpected way.
How do we end up with over 2000 fields? Are we using statically mapped ECS fields instead of ecs@mappings which uses dynamic templates?
@jasonrhodes Thanks for sharing the information about the current limitations. I was wondering if it would be an option to enable dynamic mapping on a cluster similar to one of ours (like QA or any cluster in which we have a lot of alerting rules) and see how many fields will be added. This would give a sample for the question: "Would it be conceivable for a customer to group alerts by, say, 100 different fields? 200? ". We can also use that instance to test manually and see this approach in action.
By adding dynamic mapping, we will save the mappings of ECS fields two times in this case, one at the root level and one for this group by field which does not have an issue by default, just to keep it in mind in case there are possible improvements to reuse mappings at different levels. (I am not sure if there is such a feature.)
Another topic to discuss is whether using dynamic mapping can cause an issue regarding the type of field that will be added dynamically. Would it be possible that the type that is added dynamically does not match the actual type of the field? Can it be an issue?
If a user adds a group by field mistakenly, would that be a problem besides having an extra unused mapping? Is there a possibility that users add many group by fields mistakenly? If yes, What is the process of correction? (Similar to the question, "what would happen if a customer did exceed this limit by grouping in an unexpected way.") What would happen if the user changed the shape of their data and renamed the list of groups by fields by migrating to a new set of field names? Can we have a clean-up process for such a case?
And, since we have fieldsForAAD
that limits the fields we show in the alert table and auto-suggestions in the KQL bar, how would it work with dynamic mapping? (Would it work as expected if we add kibana.alert.groupings.*
or something similar to that list?)
How do we end up with over 2000 fields? Are we using statically mapped ECS fields instead of ecs@mappings which uses dynamic templates?
@dgieselaar I think it is related to using the .alerts-ecs-mappings
component template for alerting indices, which has the mapping of all ECS fields statically.
@maryam-saeidi are we required to use it? @pmuellr Is this legacy, or can we switch to ecs@mappings? We had this issue years ago when RAC started, and we had very long discussions where IIRC we agreed not to map all ECS fields by default, precisely because of this reason.
@dgieselaar That is currently still the way we introduce ECS mappings into the alerts documents. I will bring up addressing this issue with the team.
@ymao1 yes please, it would be great if we can address it on the short term - it seems counterproductive to have discussions about adding dozens of fields when we're needlessly creating explicit mappings for around 2000 of them of which we only use a handful by default.
Feels like this conversation is beginning to spin its wheels a bit, putting us at risk of still not having a consistent and usable way to do what we're hoping to do. @ymao1 (cc @kobelb) it would probably be helpful to have a realistic assessment of whether that linked issue sounds viable in the short term.
Observability folks, if we continue to include ~2000 ECS fields in the alerts index mappings, would we still be comfortable introducing another dynamically mapped field for kibana.alert.grouping
? Can someone confirm the specific downsides of using the flattened type for our various use-cases with this grouping field?
Can someone confirm the specific downsides of using the flattened type for our various use-cases with this grouping field?
AFAIK they don't show up in _field_caps, meaning we also cannot validate the presence of the field before running the query - which is (unfortunately) a necessity for ES|QL. They also don't support anything other than keywords and a subset of queries, which I expect to be mostly fine (that is, until someone uses a boolean or a long for grouping :) ).
FWIW, I don't think we should block this based on the mappings issue. Ideally we just go with dynamically mapped objects, and if the field limit becomes an issue, we have something that will have a much bigger impact (switching to ecs@mapping) rather than us having to use workarounds like flattened fields.
Thanks, @dgieselaar -- @andrewvc I'd love your take on this re: going with a dynamically mapped object.
@jasonrhodes @andrewvc I believe being able to manually test this approach on a cluster similar to one of ours (like QA or any cluster in which we have a lot of alerting rules and related data) and see this approach in action would be beneficial. I am mostly thinking about answering questions like:
And, in general, figuring out any other issues that might show themselves while testing this approach in a real-world scenario. Is this something that we can consider?
Also, we have the following open questions to answer:
It should be fairly simple to run some queries against our overview cluster for a rough idea of how many group by fields are in use there.
I defer to @andrewvc on the rest of these questions. I don't have a great sense of how big the risk is if we move forward with a dynamically mapped field in this kind of shared index space, but I'm also a bit nervous we could overthink things and spend too much time being defensive about it, when most of these questions might already be unanswered with regard to what happens in these same scenarios if the (mostly irrelevant) ECS mappings grow, etc.
A possible alternative is using a painless script to filter out false positives.
Coming back around to this discussion, thanks for the ping @maryam-saeidi !
My current thought is dynamic mapping for some objects in the mappings is probably ok, though we should obviously think it through:
index.mapping.total_fields.limit
, and how does that change?index.mapping.total_fields.ignore_dynamic_beyond_limit
, and how does that change?Guessing that for these grouping fields, we'll be fine. The values are field names, right? So they'd basically become the same key under this new object. I'd guess within a project/deployment, the cardinality is fine. The problems you might expect would be with something like date-based field names, seems unlikely folks would be creating "random" key values.
If we do this, I'm sure we'll do more of this :-). So it would be good to have some experience of what happens when we hit the limits. I suspect it's kinda silent, assuming we can set a limit and have ES continue, presumably ignoring new "fields". And thus an SDH-generator.
@mikecote @ymao1 thoughts?
Beyond the mappings, there was some thought about how the context variables would be accessed, and I think that's a good thing to think about as well. Seems like we would actually want an "ordered dictionary" kind of collection, since I think the current shape doesn't tell you the "order" of the grouping. But since neither JS nor ES support that, do we need a separate array of the keys in grouping order, so someone could iterate over them that way? JS actually does sort of support ordering of properties in objects, but I'd like to not depend on that, as you can lose the orderings in different ways. I'd like to see some mustache template examples accessing these fields.
I think the relevance of the mustache fields is < the mappings; the mappings are hugely important, the mustache fields - we can improve over time, or probably find potentially verbose solutions to whatever shape they are given. But still something to think about.
what is the expected cardinality of the fields? how are we currently dealing with the max # of fields index.mapping.total_fields.limit, and how does that change? how are we currently dealing with the max # of dynamic fields > index.mapping.total_fields.ignore_dynamic_beyond_limit, and how does that change? how do we find out when we hit a limit? what do we do when we hit a limit? what happens when a limit is hit, and causes an SDH, and how do we figure out that's the problem?
I think the concern is that the number of dynamic fields underneath the group object could grow unbounded and we could reach the total fields limit at which point the only resolution would be to increase the total fields limit which would require a new release. This is something we used to deal with in the rule registry when the index resources were installed as needed. Since we started installing alert resources on Kibana startup, we are able to catch these issues during development time (reaching the total fields limit).
However, if the number of groups directly correlates with the number of alerts that could be generated during a single execution, we do already cap that value (default 1000) so maybe that's ok? Or does the possible groups correlate with all possible groups that have been cumulatively generated by the rule, in which case it could grow unbounded? I don't have a good sense of this here.
I think the concern is that the number of dynamic fields underneath the group object could grow unbounded and we could reach the total fields limit at which point the only resolution would be to increase the total fields limit which would require a new release. This is something we used to deal with in the rule registry when the index resources were installed as needed. Since we started installing alert resources on Kibana startup, we are able to catch these issues during development time (reaching the total fields limit).
Ya, I guess we will need some estimate on the cardinality, and then increase the current max we have by that amount.
However, if the number of groups directly correlates with the number of alerts that could be generated during a single execution, we do already cap that value (default 1000) so maybe that's ok? Or does the possible groups correlate with all possible groups that have been cumulatively generated by the rule, in which case it could grow unbounded? I don't have a good sense of this here.
I believe the coorelation is how many fields they group by, over all rules in a single "index". So, if they only ever grouped by the same 3 fields over all their rules, there would be 3 new fields.
@ymao1 @pmuellr if we run into the field limit we have https://github.com/elastic/kibana/issues/168497. That issue has been open for a year, and I'd like to use any objections around this as a forcing function to actually get that ticket done - the value of getting that over the finish line seems immense and immediately makes this discussion much simpler. I remember I spent a lot of time during RAC arguing about why statically mapping ECS fields in all AAD indices is a bad idea. The conclusion back then was that only Security AAD indices would do this so I'm not sure why this has now been applied to all AAD indices, including the Observability ones. I am pushing on this because it does not make sense to me to have a discussion about adding a few dozen of fields (tops) when we have the opportunity to cut back the amount of mapped fields by two orders of magnitude.
@pmuellr your questions make sense, however, they are problems that exist today. I don't expect this feature to materially change the amount of mapped fields. For reference, the cardinality of kibana.alert.group.field
in the overview cluster is 19, for 1 million alerts.
@dgieselaar I'll move that issue back into triage and we'll see if we can prioritize it.
Apologies for missing the previous pings. I'm +1 on reducing the current field usage along the lines @dgieselaar proposes and also on the dynamic mappings. I think it's the best balance of flexibility and performance.
@pmuellr @ymao1 do we have any rough estimates in terms of effort / time to deliver here?
Hi everyone,
I created a PoC to test dynamic mapping and did some tests and here are my findings:
index.mapping.total_fields.ignore_dynamic_beyond_limit
enabled (PR), we will get the following error, and that alert will not be reported (the rule execution is successful):
Error writing alerts for observability.rules.custom_threshold:53301b6a-5bb6-42ee-a8ba-ac3d352dbaf7 'Custom threshold'
If we enable index.mapping.total_fields.ignore_dynamic_beyond_limit
, then the extra fields will be saved, but they will not be mapped, and we will see these fields in the _ignored field as shown below:
In general, I see the main issue with hitting the limit is not being able to search the new fields, but the alerts will still be generated, and we can see the data in the alert flyout (if we enable index.mapping.total_fields.ignore_dynamic_beyond_limit
), so maybe we can consider moving forward with dynamic groups even before having dynamic ECS fields mapping, especially considering the sample data provided above:
kibana.alert.group.field
is 19 in the overview cluster (comment)I also checked the overview cluster, and here are the number of fields for different alert indices (I used /_field_caps?fields=* for this purpose) |
Index | Number of field mappings | Number of available mappings |
---|---|---|---|
.alerts-observability.metrics* | 2121 | 379 | |
.alerts-observability.threshold* | 2091 | 409 | |
.alerts-observability.logs* | 2121 | 379 |
maybe we can consider moving forward with dynamic groups even before having dynamic ECS fields mapping
We discussed this in a ResponseOps call last week, and concur. They are similar but different, and I think we will get a bit more experience with the dyanmic fields by starting with just the dynamic groups.
Seems like we will want to add something to the framework alert writer, to catch the _ignored
fields (don't think we do today), and surface them somehow.
I'm catching up with the issue and one question I don't see asked is why we are not using the alert ECS mappings at the root level to accomplish this story? I'm sure there's a reason for it but it's not clear to me after reading the use cases on the GitHub issue..
- The field should be searchable/queryable reliably without false positives
- Use in action template of "Summary of alerts" action frequency (described in https://github.com/elastic/kibana/issues/183248#issuecomment-2107348632 below) without relying on index
The root fields seem to work for this use case, and I'm assuming the values are already populated at the root level today?
- Auto-suggestion on KQL bar should suggest this field
Maybe is this requirement that needs something special. Is there something where we only want auto complete on the group by fields?
We structured the alert documents in a way that this data can be surfaced at the root and then leverage this structure for maintenance windows and conditional actions. It would feel inconsistent if we provided multiple sources to accomplish the same thing.
why we are not using the alert ECS mappings at the root level to accomplish this story? The root fields seem to work for this use case, and I'm assuming the values are already populated at the root level today?
Yes, that is correct; we already saved the group by keyword ECS fields at the root level. The issue is related to handling groups that are not ECS fields. For example, in Otel data, we have k8s.cluster.name instead of orchestrator.cluster.name.
Thanks @maryam-saeidi. I discussed with the team, and we feel comfortable if we find a way to implement this change for kibana.alert.group
within the following constraints:
kibana.alert.group
to go beyond, say, 25.keyword
to prevent type mismatchesIf you're good within those constraints, we'd be happy to have you or someone else prototype this for the team to review.
I'm +1 on what @mikecote proposes, I'm curious about how we'd enforce the guardrail, if we have a place where we can easily do that validation that's great, I'm just curious where it'd go.
@mikecote FWIW, Mary already put up a POC here: https://github.com/elastic/kibana/pull/199298. Is your ask to include some kind of guardrails in that POC?
FWIW, in SLOs the fields are called slo.groupings.*
. I like singular better than plural (which is how we commonly name fields) but maybe it's more important to be consistent with the SLO field names in this case?
Echoing what Dario mentioned above, I did a PoC, and I tried to find an option to limit the number of dynamic mappings for a field but didn't see such an option. Any proposal on how we can achieve that?
FWIW, in SLOs the fields are called slo.groupings.*. I like singular better than plural (which is how we commonly name fields) but maybe it's more important to be consistent with the SLO field names in this case?
Good point! If we have a similar definition of saving group information in slo.groupings
, using the same name might not be a bad idea.
I tried to find an option to limit the number of dynamic mappings for a field but didn't see such an option. Any proposal on how we can achieve that?
I don't have an idea at this time. I think that would be the last piece left for the PoC, to find a way to guardrail so we guarantee only a limited number of fields get mapped. Is that something you could take time to research? Would be curious to see what options exist on how this could be done in Kibana side or Elasticsearch side?
Currently we have
kibana.alert.instance.id
in all alerts that saves comma separated group values in the alert document. We would like to have a field that provides information in the form of {field, value} pair, and allows for individual {field, value} to be searchable/queryable in the alert document. The requirement of this field is discussed in the RFC here.Based on the discussion in above RFC, the Custom threshold rule saves group information in AAD with
kibana.alert.group
field which is an array of{ field: field-name, value: field-value }
.We need to streamline the method of saving group information in AAD across all Observability rules.
Use cases
Rules where group info should be saved in its dedicated field in alert document
kibana.alert.group
arraykibana.alert.group
arraykibana.alert.group
arraykibana.alert.group
arrayAcceptance criteria