[RAC] RFC: Index naming and hierarchy

banderror commented 3 years ago

Summary

There's more and more questions and concerns being raised regarding rule monitoring implementation for RAC. I'm working on https://github.com/elastic/kibana/pull/98353 which implements "event log" abstraction within rule registry that will be used for writing and reading both alerts and rule execution logs.

This RFC proposes some naming and structure for RAC indices, hierarchy for rule registries, and lists a few open questions and concerns.

Proposal

Index aliases naming convention would be similar to Elastic data stream naming scheme:

{prefix}-{consumer}.{additional.log.name}-{kibana space}

where:

{prefix} will be .alerts by default. Users will be able to override it in Kibana config and set it to any other value, e.g. .alerts-xyz or .whatever (for compatibility with legacy Kibana multitenancy).
{consumer} will indicate the rule/alert consumer in terms of Alerting framework (more info). We will probably have security, observability and stack as consumers.
{additional.log.name} will be used to specify concrete indices for alerts, execution events, metrics and whatnot. We will probably have only alerts and events as concrete logs. It might be good to have short and sweet names without - or . in them (perhaps _ is ok).
{kibana space} will indicate space id, for example default. In case of space agnostic alerts, the no.space placeholder can be used.

Examples of concrete index names

For clarity, this section contains the concrete index names that are created the Security and Observability solutions.

Security:

.alerts-security.alerts-{kibana space}-000001        // (1) 
.alerts-security.events-{kibana space}-000001        // (2)

(1) The Alert documents that support human workflow and are updatable. (2) Rule-specific execution events and metrics created by Security rules to enhance our observability of alerting.

Technically it will be possible to derive child logs from these alerts and events, e.g. .alerts-security.alerts.ml-{kibana space}. Although we don't think that we need this in Security at this point.

Observability:

.alerts-observability.{apm,uptime,metrics,logs}.alerts-{kibana space}-000001   // (1)
.alerts-observability.{apm,uptime,metrics,logs}.events-{kibana space}-000001   // (2)

.alerts-observability.metrics.alerts-no.space-000001   // (3)

(1) The Alert documents that are updatable. Same exact semantics as for Security (1) (2) Supporting documents (evaluation) for the Alerts + execution logs to be used for the Observability of Alerting (3) Example of space agnostic alert index. This can be used by the space-agnostic Rule types like the Stack Monitoring might need. The no.space "space" is not a valid Kibana space name, so this pattern can be used as a placeholder.

Diagram

Here's a structural diagram showing some rule execution dependencies in the context of RAC and how the proposed indices fit the whole picture:

RAC Rule execution dependencies, indices@2x

diagram source

banderror commented 3 years ago

@spong @dgieselaar @tsg @jasonrhodes @kobelb @XavierM @yctercero @dhurley14

please review 🙂

dgieselaar commented 3 years ago

Thanks @banderror for putting this up! Couple of questions:

Do we need the version in the index name? I added this, but I copied it from the event log, and I'm not sure whether we actually need it. Can you think of any scenarios?
I think we're broadly in agreement that it will be .alerts-* by default, with a configuration option to point it to a different index, e.g. to match kibana.index. Do you mind updating the RFC to reflect that?
What do the index names look like for stack rules?
Do we need separate indices for alerts/events? I think it makes sense. But it makes some things a little harder, like figuring out the index target. I assume that whoever has access to e.g. security alerts, would have access to other events (rule monitoring, evaluations, state changes) as well. If that is the case, the user would have to be granted permission to .alerts-*-security, and that asterisk is greedy I think, so administrators might accidentally grant broader privileges than intended. Something like .alerts-security*, or .alerts-security-myspace* is perhaps more reasonable. We could also use .alerts-security.alerts-myspace and .alerts-security.events-myspace. Ideally we can stay close to the data stream naming scheme, if we decide to switch to that at some point.

spong commented 3 years ago

With regards to 1. The problem with .kibana- prefix. In async discussions it was determined with a fair degree of confidence that using .alerts-* indices are going to work for our use cases. We will provide a user-configurable kibana.yml override value to override this, similar to kibana.index, which may only be available in 7.x, but will allow legacy multi-tenancy users who have segmented their Kibana entities via kibana.index to continue to do that, so long as they also specify this new configuration option as well. Setting it to, for example, "xyz", will store alerts in .alerts-xyz-*, etc. RBAC implications here would require T-Grid to have a toggle for using the Kibana Security Model (feature privileges and kibana_system user), or ES Index Privileges. Will need to scope this accordingly with the RBAC efforts.

tsg commented 3 years ago

If that is the case, the user would have to be granted permission to .alerts--security, and that asterisk is greedy I think, so administrators might accidentally grant broader privileges than intended. Something like .alerts-security, or .alerts-security-myspace* is perhaps more reasonable. We could also use .alerts-security.alerts-myspace and .alerts-security.events-myspace. Ideally we can stay close to the data stream naming scheme, if we decide to switch to that at some point.

++, I like something like .alerts-security.events-myspace and .alerts-security.alerts-myspace so security shows up before event/alerts. The . is to stay close to the datastream naming model, like @dgieselaar suggested.

It might be good for all registries to have a short and sweet name without - or . in it (perhaps _ is ok).

What do the index names look like for stack rules?

Somehting like .alerts-stack.alerts-default? or .alerts-core.alerts-default?

Do we need the version in the index name? I added this, but I copied it from the event log, and I'm not sure whether we actually need it. Can you think of any scenarios?

I'm not sure on this one, the data streams don't include the version so I'm thinking to start without first. We can add it later if we really need it.

banderror commented 3 years ago

Thank you for comments, this is very helpful 👍

Kibana version in the name

Do we need the version in the index name? I added this, but I copied it from the event log, and I'm not sure whether we actually need it. Can you think of any scenarios?

I'm not sure on this one, the data streams don't include the version so I'm thinking to start without first. We can add it later if we really need it.

I just followed the existing implementations as well, so not sure. Off the top of my head I'd imagine this version number could be helpful for document migrations. On the other hand, migrations could be built on top of a different number specifically used for tracking changes in the schema. For example, migration system for .siem-signals index uses a version of the corresponding index template. When we bump this version in the code, documents get reindexed into a new index with the new template, and the alias is being updated to point to this new index. I'm not very familiar with this implementation, but I think this is the rough idea behind it.

Other than that, I don't have any ideas regarding use cases for Kibana version in the index name. I would support your suggestions and remove it for now 👍

Multiple indices for alerts, rule execution events etc

Do we need separate indices for alerts/events?

I think yes, but I'm also open for any objections. Why I'd say separate indices are a better option:

Alerts and execution events will likely have very different schemas: subset of standard ECS, custom fields added, and so the resulting ES mappings will be different. At least this is true for Security.
Execution events for Observability and Security will probably have slightly different schemas as well. They will have something in common and also something specific to the solution. So I think it would be good to define a common alerts schema, a common execution events schema, build a root registry using them, and then extend those schemas in the child solution registries. We need a common schema for the unified app. And specifics to be used within solutions. Having separate schemas seems to be cleaner and easier to maintain.
Separate types of data in separate indices = more lightweight indices, more efficient queries.
At this point we don't have any UI where we would combine alerts and execution events in a single list or table.
Ability to define different ILM policies for alerts and execution events. In Security, I think we need to store alerts for as long as possible. While execution events like status changes, metrics like how much time was spent for indexing etc - this is probably not something that needs to be kept for a long time.

Naming in general

I like the suggested data stream naming scheme! So if I got it right, this is what we're gonna have:

{alerts prefix}-{name of the log}-{kibana space}

{alerts prefix} will be .alerts by default. Users will be able to override it in Kibana config and set it to any other value, e.g. .alerts-xyz or .whatever.

{name of the log} will represent the hierarchy of logs used in rule registries. Name will be a combination of parts, parts will be concatenated with .. The first part will be solution name, the second one - type of the log.

Examples: security.alerts, security.events (or security.execlog?), observability.alerts etc.

My questions regarding this naming:

Do you think .events is a good name? Looks a bit ambiguous to me, in the sense that "events" in terms of detection engine are the source objects that detection rules are executed against.
Are you expecting to have child logs in observability, e.g. observability.events.{rule type} or observability.alerts.{service name} or something like that?
Do you think {kibana space} should be always at the and of the index alias? Could there be arguments for {alerts prefix}-{name of the log}-{kibana space}-{name of the child log}? Just want to make sure we're not missing anything here. This naming convention may impact the implementation of https://github.com/elastic/kibana/pull/98353

It might be good for all registries to have a short and sweet name without - or . in it (perhaps _ is ok).

Agree 👍 I will add this check to the implementation.

Stack rules

What do the index names look like for stack rules?

Somehting like .alerts-stack.alerts-default? or .alerts-core.alerts-default?

I'm not really aware of any requirements for stack rules, maybe I've missed that part. To clarify, stack rules are the rules that can be created from the Stack Management UI (/app/management/insightsAndAlerting/triggersActions/rules)? I can see there are many different types of them, maybe these types would require dedicated indices, or maybe not.

.alerts-stack.alerts-{space} or .alerts-core.alerts-{space} sounds good to me. Or .alerts-stack.alerts.{rule type}-{space} or something like that.

Do we have any requirements/plans for stack rules? In terms of RAC, stack rules == rules which will be created directly from the unified alerting app?

banderror commented 3 years ago

Oh yes, and I will of course update the RFC, just want us to agree on most of the details. Thanks again for asking all these questions and giving suggestions, this is a lot of helpful info for me.

@dgieselaar @spong @tsg

pmuellr commented 3 years ago

Trying to understand the difference between the "alerts" and "execlog" and current event log indices.

I assume "alerts" is intended to hold data regarding the alert being run - for index threshold, that would include the threshold being tested again, the value calculated from the es aggs call to compare to the threshold, etc. And so the "execlog" indices would be like the current event log, which just capture the execution times/duration, status, etc.

Which I think means the event log itself becomes unneeded, eventually.

But I'd like to understand the field differences between the event log and execlog. Because I'm wondering if we can live with the current event log for now, especially given the following:

At this point we don't have any UI where we would combine alerts and execution events in a single list or table.

pmuellr commented 3 years ago

re: kibana version in index name

We did this for the event log, because it solved a problem for us, and we noticed other Kibana apps doing this - something in o11y, but not sure exactly what.

The problem is: "migrating" indices when Kibana is upgraded. Obviously (I hope), we weren't planning on doing ".kibana -style" migrations of the event log, but we were worried about situations where we might want to change index mappings. Would we hit scenarios where we'd want mapping changes that wouldn't work well with the existing data, potentially requiring a re-index (and even that might not be able to "fix" something)?

Adding a version to the name makes this problem go away! We always create a new index template, alias, and initial index when Kibana is updated. And then we end up using .kibana-event-log-* in queries over ALL the different versions.

Obviously, this doesn't handle every case. If the structure between Kibana versions changes "too much", we could be in a position where we wouldn't be able to validly query old data, and similar sorts of problems. But this felt like the best thing to do, back when we wrote this.

If we want to explore not using the version in the index name, then I think we need to have a really good story for what happens when the mappings change when Kibana is updated. Haven't thought too much about this (since it's not a problem for the event log).

ymao1 commented 3 years ago

What do the index names look like for stack rules?

@dgieselaar Is this question driven by the desire to view/manage alerts from stack rules and alerts from within each solution? If so, is it just limited to stack rules? Is there the desire to view/manage alerts from o11y rules inside security and security alerts from o11y? At one point, I saw the suggestion to determine the index to write to based on the consumer of the alert, not the producer. Is that something that is up for consideration?

tsg commented 3 years ago

At one point, I saw the suggestion to determine the index to write to based on the consumer of the alert, not the producer. Is that something that is up for consideration?

@ymao1 ++, I think using the consumer makes more sense. This way create alerts directly in the indices where we need them, and we don't have to query across solutions.

tsg commented 3 years ago

@pmuellr For eventlog, do you store the full version including the patch level (e.g. 7.11.2)? This is what Beats used to do before the new indexing strategy. It does help on upgrades, but it can mean creating a lot of indices in case of frequent upgrades.

The question is if we really need it, or is it enough to trigger an ILM rotation when we do an upgrade that changes the mapping. This might get complicated if we have multiple Kibana instances and they aren't upgraded all at once (is that permitted?).

FWIW, when the index naming strategy was discussed, I pressed on the need of including the version: https://github.com/elastic/observability-dev/issues/283#issuecomment-527372212

My only reason for going first without is that this is what the new indexing strategy also does the same, and it can be added later.

gmmorris commented 3 years ago

At one point, I saw the suggestion to determine the index to write to based on the consumer of the alert, not the producer. Is that something that is up for consideration?

@ymao1 ++, I think using the consumer makes more sense. This way create alerts directly in the indices where we need them, and we don't have to query across solutions.

That's good to hear, as this questions is one of the main blockers for migrating Stack Rules to Alerts-as-Data. Can we proceed with the assumption that consumer will be in the index name? Who should we work with to make sure this is reflected in the Rules Registry/ Alert indices?

banderror commented 3 years ago

Regarding consumer vs producer, could you explain what that actually means? What would be examples of producers and consumers in terms of RAC?

From the current code of detection engine, I can see that:

when we register our rule types (siem.signals and siem.notifications), we specify producer: 'siem'
when we create rule instances, we specify consumer: 'siem'

So seems like for our rules producer and consumer are always the same thing - our app.

Is that assumed to change in some way? Would stack rules be able to generate alerts for solutions? Does the naming already discussed here (.alerts-{solution}*) fit, I mean can we treat solution name as a consumer name?

@tsg @ymao1 @gmmorris

banderror commented 3 years ago

Who should we work with to make sure this is reflected in the Rules Registry/ Alert indices?

@spong @dgieselaar @banderror :) I will incorporate all the feedback from this RFC to https://github.com/elastic/kibana/pull/98353

banderror commented 3 years ago

@spong regarding version in the index name and Tudor's comment https://github.com/elastic/kibana/issues/98912#issuecomment-831550526

I'd say maybe we should stick to the same approach as we already have in the .siem-signals implementation - unless there are any known issues with it and our signals migration?

ymao1 commented 3 years ago

@banderror The alerting framework maintains the idea of producer and consumer, where the producer is the solution creating the rule type (security, uptime, apm, stackRules, etc) and consumer is essentially the location within Kibana where the user is creating the rule of that type. You are correct in that right now, there are not many examples of rules with different consumers and producers, as security rules defined by security are only created and viewed within security. I believe we want to allow for this capability though, where within security, a user could create a rule of either a security rule type or a stack rule type. If we allow for this, we would want the user to see alerts from stack rules that were created from within the security solution.

If the index schema is based on the producer where security produced rules are written to .alerts-security-* and stack produced rules are written to alerts-stack-*, then the RAC client would need to broaden the indices that it queries over to get all alert data for rules created within a consumer (solution). In addition, if we're giving users privileges to specific index prefixes in order for them to create ad-hoc visualizations, we might be limiting the alerts they see in that manner as well.

banderror commented 3 years ago

Oh I see that now, thank you @ymao1 for the clear explanation.

I believe we want to allow for this capability though, where within security, a user could create a rule of either a security rule type or a stack rule type. If we allow for this, we would want the user to see alerts from stack rules that were created from within the security solution.

Gotcha. Maybe it means that a rule type (a stack rule type in this case), instead of indexing alerts directly, will need to use some kind of an indexing "strategy" injected into it, which would know how to properly index the alert into the destination alerts-as-data index (and would respect the document schema and mappings). Otherwise we would need to have the same mappings in all alerts-as-data indices which I'm not sure would it be feasible or not.

Do you think this naming might work .alerts-{consumer}.alerts-{kibana space}? We will probably still have security, observability and stack as consumers?

banderror commented 3 years ago

Regarding Kibana version in the name and migrations/rollovers. This is how it's implemented in Security for .siem-signals index.

Basically, we don't have Kibana version in the index name, but instead we maintain index template version in the code, and we have two mechanisms for two cases which both use this version:

There's an index rollover logic for non-breaking changes. For example, if we add a new field to the index template, we bump its version and the app will trigger index rollover automatically - without any explicit action required from the user.
There's a migration logic (route, reindexing) for breaking changes. If we introduce breaking changes to the schema/template, then the users will have to call the migration API. The API does actual reindexing of documents into a new index, not just simple rollover.

We could use the same or similar approach for RAC indices. Or maybe there can be cons to that like

This might get complicated if we have multiple Kibana instances and they aren't upgraded all at once (is that permitted?).

dgieselaar commented 3 years ago

re:

This might get complicated if we have multiple Kibana instances and they aren't upgraded all at once (is that permitted?).

We could alleviate this concern by scheduling a task, as that is guaranteed to be picked up by a single Kibana instance (I think Gidi or Patrick suggested that).

dgieselaar commented 3 years ago

@banderror The alerting framework maintains the idea of producer and consumer, where the producer is the solution creating the rule type (security, uptime, apm, stackRules, etc) and consumer is essentially the location within Kibana where the user is creating the rule of that type. You are correct in that right now, there are not many examples of rules with different consumers and producers, as security rules defined by security are only created and viewed within security. I believe we want to allow for this capability though, where within security, a user could create a rule of either a security rule type or a stack rule type. If we allow for this, we would want the user to see alerts from stack rules that were created from within the security solution.

If the index schema is based on the producer where security produced rules are written to .alerts-security- and stack produced rules are written to alerts-stack-, then the RAC client would need to broaden the indices that it queries over to get all alert data for rules created within a consumer (solution). In addition, if we're giving users privileges to specific index prefixes in order for them to create ad-hoc visualizations, we might be limiting the alerts they see in that manner as well.

++ on this - which is why I think we should always include all technical fields in the shared component template. That should hopefully be a few dozen only. But that could allow users to point any rule to any index, and all the other stuff would be metadata, which may or may not need a runtime field to be queryable.

gmmorris commented 3 years ago

re:

This might get complicated if we have multiple Kibana instances and they aren't upgraded all at once (is that permitted?).

We could alleviate this concern by scheduling a task, as that is guaranteed to be picked up by a single Kibana instance (I think Gidi or Patrick suggested that).

That was me, but even that's not super obvious, we'd need to understand the exact flow you want to support.

BTW rolling upgrades are not supported, so it's less about instances being upgraded separately, and more about more than one instance being booted at the same time.

banderror commented 3 years ago

I updated the proposal in the description based on all your feedback. Thank you! Let me know if I forgot to mention anything.

dgieselaar commented 3 years ago

Fwiw, consumer is currently not Observability for rule types, but APM/Uptime etc. My suggestion would be to not tightly couple this to the the alerting framework's interpretation of consumer. Generally, I feel we should avoid technically depending on the index name, and treat it as a scoping mechanism for allowing administrators to more easily grant access to subsets of data. Preferably we use a query when we query alerts instead of reconstructing the entire index alias. Not sure if that is being suggested here, but wanted to call that out.

spong commented 3 years ago

++ on this - which is why I think we should always include all technical fields in the shared component template. That should hopefully be a few dozen only. But that could allow users to point any rule to any index, and all the other stuff would be metadata, which may or may not need a runtime field to be queryable.

@dgieselaar, is the thought then that solutions would need to explicitly allow-list/enable which stack rules they support and then we'd combine those component templates with the solution-specific component templates so the solution indices have all the necessary fields to support stack rules? Or would solutions just include _all stack rule component templates by default so there's no ambiguity between which solutions support which stack rules?

@banderror -- updated RFC LGTM! 👍 May want to have a section with regards to storing version in _meta as opposed to the index name as you detailed here, but other than that I think we might be good to go! 🙂

gmmorris commented 3 years ago

Fwiw, consumer is currently not Observability for rule types, but APM/Uptime etc. My suggestion would be to not tightly couple this to the the alerting framework's interpretation of consumer.

@dgieselaar - Wouldn't diverging here make it far harder though? RBAC is already a complicated mechanism, if we start diverging on this (using something other than FeatureID in consumers/producers) we're adding another moving part to this mechanism.

I'm not necessarily objecting here, but flagging that this would come at a cost to maintainability/reliability, and we should step with caution.

cc @ymao1

dgieselaar commented 3 years ago

@gmmorris My concern is only about the index/alias name, not necessarily about what ends up in the data, or how it is functionally interpreted. If we do end up hard-coupling the index name to consumer, it would be .kibana-alerts-apm*, not .kibana-observability-apm-*. Maybe that's okay though.

dgieselaar commented 3 years ago

@spong:

is the thought then that solutions would need to explicitly allow-list/enable which stack rules they support and then we'd combine those component templates with the solution-specific component templates so the solution indices have all the necessary fields to support stack rules? Or would solutions just include _all stack rule component templates by default so there's no ambiguity between which solutions support which stack rules?

The latter. But in my head these would be technical fields only. Like alert.id, alert.threshold.value, alert.severity.level, alert.building_block, etc. So maybe a few dozen, but that is very much a number that I'm making up on the spot.

tsg commented 3 years ago

I was thinking that there will be an allow-list of "foreign" rule types that can be instantiated in the Solutions. For example, I expect the Security solution to make use of the ES rule type, the ML one, maybe maps rules, maybe the threshold rule type from Observability. I think the Observability solution will find the EQL/correlation rule from Security useful, the custom query one, the ES rule type, etc.

@dgieselaar

The latter. But in my head these would be technical fields only. Like alert.id, alert.threshold.value, alert.severity.level, alert.building_block, etc. So maybe a few dozen, but that is very much a number that I'm making up on the spot.

This is where the ECS by default discussion is coming up again, some of these rules will be a lot more useful if we have ECS as indexed fields. We can rely on run-time fields for them, but that might add complexity to the code.

However, in practice:

The Security solution has all ECS anyway, which means that it can "receive" any foreign rule type
The Observability solutions can decide what makes sense for them. For example, APM might not need to use external rule types, so a "slim" index mapping fits better.

I was thinking that ECS everywhere will simplify the mental model, but if the implementation is reasonable, I'm good with letting each solution choose.

dgieselaar commented 3 years ago

@tsg these technical fields are mostly not in ECS (yet), It doesn't really feel like it's the same problem to me. Maybe that could be the case in the future though.

gmmorris commented 3 years ago

I was thinking that there will be an allow-list of "foreign" rule types that can be instantiated in the Solutions. For example, I expect the Security solution to make use of the ES rule type, the ML one, maybe maps rules, maybe the threshold rule type from Observability. I think the Observability solution will find the EQL/correlation rule from Security useful, the custom query one, the ES rule type, etc.

I know we're talking about the ECS fields here, not RBAC, but I feel it's worth explicitly stating that RBAC does require each type to be explicitly granted to a role. So, for example, if Security want their users to have access to the ES Query Stack Rule when they're inside of Security Solution, then ES Query will have to be explicitly stated in the SecuritySolution all/read privileges.

tsg commented 3 years ago

My concern is only about the index/alias name, not necessarily about what ends up in the data, or how it is functionally interpreted. If we do end up hard-coupling the index name to consumer, it would be .kibana-alerts-apm, not .kibana-observability-apm-. Maybe that's okay though.

I think it's ok to be loosely based on the consumer, rather than the exact consumer name. So it's ok to have .alerts-observability_apm.alerts-* instead of just .alerts-apm.alerts-*. Note the _ which we need to fit into the naming scheme proposed by this RFC.

The reason is just to make it easier to follow for our users.

kobelb commented 3 years ago

Shard-Sizes

Won't having all of these components in the index-name result in us having a bunch of rather small indices and as a result shards? The official Elasticsearch guidance is that we should aim for each shard to be between 10GB and 65GB and to aim for 20 shards or fewer per GB of heap memory.

Also, per the Elasticsearch guidance (source):

Unfortunately, there is no one-size-fits-all sharding strategy. A strategy that works in one environment may not scale in another. A good sharding strategy must account for your infrastructure, use case, and performance expectations.

We aren't going to be able to predict the perfect sharding strategy for all users because their use-cases will differ. We might have a user with a ton of alerts for the same consumer in the same space, or we might have a user with a few alerts with different consumers spread out among hundreds of Spaces. This is why I think it's important to give our users some control over what indices are created and to start with sensible defaults.

`.siem-signals-${spaceId}` during 7.x

Are we going to continue to use the .siem-signals-${spaceId} for the remainder of 7.x? This issue makes it sound like we won't. However, if we stop using the .siem-signals-${spaceId} indices in the same way that we previously have, this will be a breaking change. Users already have Dashboards, Visualizations, and Roles that are tied to these indices in the current format.

spong commented 3 years ago

Are we going to continue to use the .siem-signals-${spaceId} for the remainder of 7.x? This issue makes it sound like we won't. However, if we stop using the .siem-signals-${spaceId} indices in the same way that we previously have, this will be a breaking change. Users already have Dashboards, Visualizations, and Roles that are tied to these indices in the current format.

The goal is to move to .alerts-security-solution* in 7.15 and to provide backwards compatibility to .siem-signals-* through field aliases, so the Detections components/API's will then query against both indices for the remainder of 7.x, and we'd deprecate .siem-signals-* in the transition to 8.x. This should ensure no breaking changes within the Security Solution App/API's, however existing Dashboards/Visualizations may need to be updated to include the new alerts index (though we should be able to use index aliases here, no?). With the introduction of Kibana Feature privilege based-RBAC on .alerts, there shouldn't be any additional role changes necessary, but we will have to alert the user if they're using fine-grained index privileges that they'll need to follow the necessary instructions to bypass the Kibana Feature Privileges in favor of ES Privileges (if delivering on the ability to switch between the two).

I've opened https://github.com/elastic/kibana/issues/100103 that outlines this effort, and we can start testing these different methods as soon as https://github.com/elastic/kibana/pull/96015 is merged. My largest concern right now (so long as the aliasing works without issue) is for users with complex index level permissions, as they'll need to modify their role and enable the bypass from Kibana Feature Privilege based RBAC to ES Index Privileges. If this is determined to be an unsuitable breaking change I suppose we'll need to explore alternate options for maintaining existing functionality through 7.x

jasonrhodes commented 3 years ago

We aren't going to be able to predict the perfect sharding strategy for all users because their use-cases will differ. We might have a user with a ton of alerts for the same consumer in the same space, or we might have a user with a few alerts with different consumers spread out among hundreds of Spaces. This is why I think it's important to give our users some control over what indices are created and to start with sensible defaults.

I agree here but don't know if I understand what the sensible default is? I am a little alarmed at how much we have to manage with index naming, from sharding resource usage to RBAC control to direct index permission flexibility to user comprehension to query optimization … it's a very heavy lift.

banderror commented 3 years ago

@kobelb Regarding shard sizes

We aren't going to be able to predict the perfect sharding strategy for all users because their use-cases will differ. We might have a user with a ton of alerts for the same consumer in the same space, or we might have a user with a few alerts with different consumers spread out among hundreds of Spaces. This is why I think it's important to give our users some control over what indices are created and to start with sensible defaults.

Could you elaborate on what defaults and control this might be for example?

Regarding performance considerations in general, and related space id in the name vs no space id:

@tsg was going to chat with Elasticsearch developers regarding performance implications of having 100+ spaces in Kibana and thus 100+ .siem-signals and other indices.
Nobody from Security solution devs knows of any SDHs/issues caused by the number of indices.
I think trying to argue about performance in theory might (mis)lead to wrong conclusions :) Performance-related discussions require benchmarking. I think we could do some benchmarking for RAC implementation when it's partially ready - including testing performance on instances with 100+ spaces. We could event do it now for existing Detection Engine (.siem-signals-<space> indices), but that would not be 100% relevant test.

kobelb commented 3 years ago

Could you elaborate on what defaults and control this might be for example?

For sure. What follows ignores backward compatibility with the .siem-signals-${spaceId} indices because I don't think that we should be putting the spaceId in the index-name. I don't think this approach is sustainable long-term because of performance, user-experience and it doesn't work with alerts that are shared in multiple spaces. If we can agree on the long-term solution, I'm hopeful that we can find a stop-gap solution to keep the .siem-signals-${spaceiId} indices working for the short-term.

By default, I think that all "mutable alerts" should be written to a .alerting-alerts-default datastream that has an ILM policy with a default hot-phase that rolls over after 30 days or 50 GB; and all "immutable events" should be written to a .alerting-events-default datastream that also has the same ILM policy. Users will be able to update the default ILM policies if they need to change the rollover settings or change their retention policy for all alerting data.

At some point, treating all alerts and events the same will likely cause issues, and at that point, we should allow the user to implement a different "sharding strategy". To do so, we should take advantage of the "namespace" of the datastream and allow the user to create new "Alerting namespaces" that are reflected in the datastream names: .alerting-alerts-${namespace} and .alerting-events-${namespace}. We have the option of automatically creating new ILM policies or we can have the default ILM policies apply to these indices as well. Then, we should allow the users to specify the alerting "namespace" per Alerting rule, or per space.

This provides the user with the flexibility to adapt their index usage to their specific Alerting usage after providing sane and safe defaults.

I think trying to argue about performance in theory might (mis)lead to wrong conclusions :) Performance-related discussions require benchmarking. I think we could do some benchmarking for RAC implementation when it's partially ready - including testing performance on instances with 100+ spaces. We could event do it now for existing Detection Engine (.siem-signals- indices), but that would not be 100% relevant test.

I'm not following this logic. The Elasticsearch guidance states that there isn't a one-size-fits-all approach to segmenting indices. Architecting a system that ignores that fundamental guidance is a bad idea.

pmuellr commented 3 years ago

I don't think that we should be putting the spaceId in the index-name. I don't think this approach is sustainable long-term because of performance, user-experience and it doesn't work with alerts that are shared in multiple spaces. If we can agree on the long-term solution, I'm hopeful that we can find a stop-gap solution to keep the .siem-signals-${spaceiId} indices working for the short-term.

I think we've been assuming "saved objects as data" is part of the long-term solution, but there doesn't seem to be much action in that issue, and it's not clear it's even appropriate. Neither the alerts nor events (nor event log) indices are "saved objects". What it feels like we need is "data with saved object constraints" - where you can somehow combine elasticsearch index patterns and rules for how to generate a filter for them, that correspond to the same constraints we get with saved objects - namespace and feature constraints. And it needs to work as a Kibana index pattern for Lens/Discover usage. It seems like it's more of an index pattern-y thing, than a saved object-y thing. Or perhaps using existing facilities like elasticsearch filtered aliases, which we would somehow create as-needed, that would do the same thing.

kobelb commented 3 years ago

What it feels like we need is "data with saved object constraints" - where you can somehow combine elasticsearch index patterns and rules for how to generate a filter for them, that correspond to the same constraints we get with saved objects - namespace and feature constraints.

Agreed. I think it's largely just phrasing though. I think the end-goal is the same. We have documents in Elasticsearch indices that we'd like end-users to query in a free-form manner that we'd like to abide by the "Kibana entity model" that is generally enforced by "saved-objects".

While "saved objects as data" or "data with saved object constraints" are what we've been talking about as the long-term solution, we also have the ability to use DLS in the short-term and immediately to allow users to grant access to a subset of the documents. It's also possible short-term for our users to use the proposed "Alerting namespace" to do per-index segmentation as well.

jasonrhodes commented 3 years ago

To do so, we should take advantage of the "namespace" of the datastream and allow the user to create new "Alerting namespaces" that are reflected in the datastream names: .alerting-alerts-${namespace} and .alerting-events-${namespace}. We have the option of automatically creating new ILM policies or we can have the default ILM policies apply to these indices as well. Then, we should allow the users to specify the alerting "namespace" per Alerting rule, or per space.

I'm not sure if this is relevant or possibly just more semantics, but I know data streams come with namespaces like this, but I had heard we weren't going to be using ES data streams for alerts for some reason (I don't know why). Is that the case, still? Does it matter for this discussion?

jasonrhodes commented 3 years ago

FWIW, putting the spaceId in the index name feels like a bad decision, long term. I think we leaned that way partially because of how siem-signals is set up but also partially to better support users who want to bypass Kibana RBAC and use the config option that makes alerts queried by current user, and then use space IDs to lock down per-space security manually outside of Kibana's RBAC model. Is that a correct summary, there?

If that's accurate, it sounds like a lot of exceptions and shenanigans, tbh. This may be an ok middle ground for these users:

we also have the ability to use DLS in the short-term and immediately to allow users to grant access to a subset of the documents. It's also possible short-term for our users to use the proposed "Alerting namespace" to do per-index segmentation as well.

pmuellr commented 3 years ago

It seems like the last mention on "migration" is this comment, but not quite settled? Another thing to factor in to the "long-term solution", but presumably we'll have point releases where mappings will change, so will need some "tactical" plan as well.

I'd love to have a story that doesn't require re-indexing (migrations).

But if we follow the outline suggested in the referenced comment, it's not clear to me - for re-indexing - how the app then knows what the names of the indices are per-version, since presumably searches would be shaped differently across differently versioned indices. Maybe a constant keyword field for the stack version would work?

pmuellr commented 3 years ago

By default, I think that all "mutable alerts" should be written to a .alerting-alerts-default datastream that has an ILM policy with a default hot-phase that rolls over after 30 days or 50 GB; and all "immutable events" should be written to a .alerting-events-default datastream that also has the same ILM policy.

One use case that already came up with these is that Security would want to keep their alerts rollover to be longer than the events rollover. Events are for shorter-term diagnostic / status kind of things, alerts are things you might want to look back for in years. IIRC.

kobelb commented 3 years ago

It seems like the last mention on "migration" is this comment, but not quite settled? Another thing to factor in to the "long-term solution", but presumably we'll have point releases where mappings will change, so will need some "tactical" plan as well.

I'd love to have a story that doesn't require re-indexing (migrations).

But if we follow the outline suggested in the referenced comment, it's not clear to me - for re-indexing - how the app then knows what the names of the indices are per-version, since presumably searches would be shaped differently across differently versioned indices. Maybe a constant keyword field for the stack version would work?

That's a good point, @pmuellr. Ideally, we'll only be making additive mapping changes to these indices so we could alter the mappings for existing indices. However, to prepare for a breaking-change to the mappings which will inevitably occur (hopefully only with major version upgrades), we could put a version in the index-name itself. Having to create a new index for every minor/patch seems a bit excessive as this is something we would want to avoid and only do as a worst-case solution. However, using an incrementing integer seems reasonable. For example .alerting-alerts.v1-default and .alerting-events.v1-default

It's worth noting that the new indexing strategy that is used by Elastic Agent/Fleet doesn't include the version in these indices, and instead will use an ingest pipeline to transform documents that are being ingested to the new format.

One use case that already came up with these is that Security would want to keep their alerts rollover to be longer than the events rollover. Events are for shorter-term diagnostic / status kind of things, alerts are things you might want to look back for in years. IIRC.

Gotcha. In that case, then I think creating two ILM policies makes the most sense: one for "mutable alerts" and one for "immutable events".

banderror commented 3 years ago

@kobelb thank you for your suggestions 👍 This does make sense to me, however there's a number of things in the setup you suggested which are not super clear.

If we'd end up with .alerting-alerts-default and .alerting-events-default by default, we'd remove from the index name the solution/consumer part (e.g. security, observability.uptime) and the Kibana space id part. In the apps, we'd still need to be able to:

Filter alerts and events by solution/consumer, so that in the Security app we'd show only Security alerts etc.
Filter alerts and events by space id, so that alerts and events generated by rules (which always live within a space) would be visible only within the current space.
Give users a way to slice and dice alerts and events in visualizations, where I think a user should be able to slice and dice things by solution/consumer and space id.
I guess in both apps and visualizations spaces and RBAC should work, meaning that the user shouldn't be able to query alerts/events they don't have access to.

Is this achievable if we don't include space id and solution/consumer in the index name? Heard "document-level security" a few times, wondering could this solve the problem if we write solution/consumer and space id to every document and then somehow set up RBAC to enforce access privileges. Sorry if this was already discussed previously but I've missed that. @spong could you please also comment on that?

This provides the user with the flexibility to adapt their index usage to their specific Alerting usage after providing sane and safe defaults.

I get it, however it's not super clear how we would give users ways to override defaults and introduce custom "namespaces". What would be the UX? Should it be configurable via kibana config, via Security/Observability app settings in the UI, via rule parameters in the UI? What if the user changes these settings multiple times (from "default" -> to custom based on rule type or rule id -> to custom based on space id -> etc). Seems complicated, lots of moving parts = ways to make a mistake.

Regarding ILM policies (retention, rollover etc) - here I mentioned that Security and Observability might have different requirements for alerts and events:

Security alerts should be stored for years (at least 1 year) or even "forever".
Security events are short-term objects, days/weeks retention would probably be enough.
Observability alerts and events probably are less demanding in terms of retention. Also, they seem like to require same/similar ILM policies. @dgieselaar please correct me if I'm wrong.

So based on that assumptions we think that keeping solution/consumer in the index name makes sense. It should be possible to set reasonable defaults for security/observability alerts/events in the code, and these defaults would probably be different.

banderror commented 3 years ago

I'm not sure if this is relevant or possibly just more semantics, but I know data streams come with namespaces like this, but I had heard we weren't going to be using ES data streams for alerts for some reason (I don't know why). Is that the case, still? Does it matter for this discussion?

@jasonrhodes I think it was "let's consider using data streams later and for now keep the naming compatible with data streams". I found the following line in the alerts-as-data agenda doc: "We don’t use data streams for now. We keep doing ILM managed indices but ideally use a naming scheme that would allow us to adopt data streams later." @tsg @spong might provide more thoughts on this.

~Not sure whether it's important for the data streams topic or not, but we in Security have a feature called "timestamp override", where the user can override the default @timestamp field name to a different one - per rule instance. This feature would need to work on top of data streams (which seem to rely on @timestamp), I just don't know if it could but thought it's worth mentioning this.~ UPD: sorry, please ignore this, this is not relevant to indices for alerts and execution events, timestamp override works with source events.

I think we could investigate it separately. I could open a ticket for it.

banderror commented 3 years ago

@pmuellr regarding migrations (thank you for raising this topic again) I think we have 2 parts to address:

New alerts <-> old alerts (.siem-signals) compatibility. Garret outlined this in the previous comment here (https://github.com/elastic/kibana/issues/98912#issuecomment-840900941) and opened a ticket for that (https://github.com/elastic/kibana/issues/100103).
Migration system for the new indices. I think this is a whole separate topic, I don't have anything new to say on top of https://github.com/elastic/kibana/issues/98912#issuecomment-832883089. What I think we're missing is a ticket for that and I could create one.

kobelb commented 3 years ago

Is this achievable if we don't include space id and solution/consumer in the index name?

Yup! When the query is being executed by Kibana for an end-user, we can restrict the documents that are returned by adding a filter. If end-users will be accessing the documents directly in Elasticsearch, they can create a role that has DLS configured that apply a similar filter.

I get it, however it's not super clear how we would give users ways to override defaults and introduce custom "namespaces". What would be the UX? Should it be configurable via kibana config, via Security/Observability app settings in the UI, via rule parameters in the UI? What if the user changes these settings multiple times (from "default" -> to custom based on rule type or rule id -> to custom based on space id -> etc). Seems complicated, lots of moving parts = ways to make a mistake.

There are definitely details to the user experience that we'll need to iron out; however, I think we should expose the full set of functionality via the UI. There's always the potential that we ship initially without the "Alerting namespace" feature and add it in a subsequent release.

Regarding ILM policies (retention, rollover etc) - here I mentioned that Security and Observability might have different requirements for alerts and events: Security alerts should be stored for years (at least 1 year) or even "forever". Security events are short-term objects, days/weeks retention would probably be enough. Observability alerts and events probably are less demanding in terms of retention. Also, they seem like to require same/similar ILM policies. @dgieselaar please correct me if I'm wrong. So based on that assumptions we think that keeping solution/consumer in the index name makes sense. It should be possible to set reasonable defaults for security/observability alerts/events in the code, and these defaults would probably be different.

I don't think we should assume that all users need different retention policies for their alerts purely segmented by the "consumer". It's entirely possible that users would prefer further segmentation within a consumer, or to treat all alerts the same.

If we do decide that we should be default segment the retention policies based on the consumer, we can always ship with "default namespaces" for the different consumers that would be automatically specified, but overrideable.

tsg commented 3 years ago

Is this achievable if we don't include space id and solution/consumer in the index name?

Yup! When the query is being executed by Kibana for an end-user, we can restrict the documents that are returned by adding a filter. If end-users will be accessing the documents directly in Elasticsearch, they can create a role that has DLS configured that apply a similar filter.

The struggle that I have with DLS is that it is harder for the user to configure compared to using index names. My understanding is that custom dashboards, Discover access, embedded Lens, etc. on the alerts data is one of the most important features of "Alerts as Data" and is required by both Security and Observability. @MikePaquette, @cyrille-leclerc please keep me honest here.

If we're telling users that in order to use Discover to view Observability alerts but not Security alerts, they need to be paying for Platinum and setup DLS via the API (I think we don't have a UI yet?), I think that's a bad experience, so there needs to be a good reason. But maybe I'm overestimating how big of a problem this would be for customers.

@MikePaquette @cyrille-leclerc If we were to store all Observability and Security alerts in a single index (.alerting-events-default), and ask them to use DLS to separate between roles when creating dashboards, would that be a deal-breaker for some of the customers?

jasonrhodes commented 3 years ago

I'm not sure about other RFC formats, but I think this one would benefit from a clear outline of the requirements of the data retrieval. For instance, "Basic users are able to query for alerts and evaluation events segmented by space ID, user-given namespace, and consumer string", etc. I'm having trouble keeping track of which requirements come from where, which are non-negotiable, etc., which makes designing the system feel like trying to make a slightly-too-small blanket fit perfectly over a bed, or some better metaphor for an impossible task.

Do we have those requirements in other documents? I imagine they're spread out everywhere, and I'm honestly not sure where they officially belong or who should own that task.

MikePaquette commented 3 years ago

@MikePaquette @cyrille-leclerc If we were to store all Observability and Security alerts in a single index (.alerting-events-default), and ask them to use DLS to separate between roles when creating dashboards, would that be a deal-breaker for some of the customers?

@tsg It depends on the resultant user experience. Here are a few possible deal-breakers:

For the security analyst persona, they should not have to know any details about how/where data associated with rules/alerts/cases are stored. Their scope for rules/alert/cases data should be implicit based on what workspace (Kibana space) they're working in. Whether in-solution, or elsewhere in Kibana, the presentation of rules/alerts/cases data must be consistently scoped with no additional cognitive burden such as knowing which field:value pair indicates their data, and requiring them to manually add a filter to scope the data to their workspace. So any RAC indexing scheme that requires analysts to learn/use these filters for scoping rules/alerts/cases would be a deal-breaker.
Basic-licensed users of the security solution currently enjoy the capability to have separate workspaces for "dev/prod" and other organizational groupings. In each workspace (Kibana space) they have independent sets of rules/alerts/cases data. To remove this popular use case from the Basic tier user would be a breaking change for many users, and would likely be a significant impediment to future adoption of the solution. So any RAC indexing scheme that removes the ability of basic-tier users to have separate scoped workspaces for rules/alerts/cases would be a dealbreaker.
The capability to execute rules against alert data (previous alerts) is a key differentiator in our security solution. Today, the rule_author/detection engineer is allowed to scope a rule to alert data in specific set of workspaces (Kibana spaces) by specifying the index patterns associated with the rule, but keeping the rule logic focused on the desired detection. Any RAC indexing scheme that forces the rule_author/detection engineer to modify rule logic to include a filter condition to indicate which workspace's alerts to include, would be counter to the principles of rule scope class that we aspire to, and would be considered a significant step backwards, if not a dealbreaker.

elastic / kibana