elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.72k stars 8.14k forks source link

[RAC][Meta] Consolidate the two indexing implementations in rule_registry plugin #101016

Open banderror opened 3 years ago

banderror commented 3 years ago

Summary

Consolidate https://github.com/elastic/kibana/pull/98353 and https://github.com/elastic/kibana/pull/98935 into a single implementation.

We ended up having 4 implementations of index management/writing/reading related/similar to the problem we're trying to solve in RAC: two in rule_registry (RuleDataService and EventLogService), one in security_solution (for .siem-signals index), one in event_log plugin. We should compare them, mind their strong and weak parts and build a consolidated implementation in rule_registry.

High-level plan:

Regarding robust index bootstrapping, consider this:

When the consolidated implementation is ready, make sure to update all references to it from plugins which already use it: security_solution, observability, apm, uptime, etc.

Tasks for 7.15

Tasks for 7.16

Backlog

Indexing and index bootstrapping logic:

API enhancements for RuleDataService and RuleDataClient:

User-defined resources::

Misc:

Consider these as well:

elasticmachine commented 3 years ago

Pinging @elastic/security-solution (Team: SecuritySolution)

elasticmachine commented 3 years ago

Pinging @elastic/security-detections-response (Team:Detections and Resp)

banderror commented 3 years ago

We had a chat with @marshallmain last week and here's a summary of what we discussed and how decided to proceed with the consolidated implementation:

banderror commented 3 years ago

I opened a draft PR (https://github.com/elastic/kibana/pull/101453) for the consolidated implementation. Here's a comparison of the 3 implementations in rule_registry:

Feature RuleDataService EventLogService Consolidated implementation
Schema Implicit, schema-less (indexing function accepts documents as any) Explicit schema for documents Explicit schema for documents
Dependency on FieldMap No hard dependency Hard dependency No hard dependency
Component templates Yes No Yes
Versioning of schema, mappings, settings No No Yes
Soft document migrations (index rollover) No No Yes
Index management API Imperative Declarative Declarative
Index management encapsulation Low-level methods are exposed to clients, clients are responsible for correct implementation of index management Low-level methods are encapsulated, machanism is responsible for index management Low-level methods are encapsulated, machanism is responsible for index management
banderror commented 3 years ago

cc @tsg @spong @xcrzx @marshallmain @jasonrhodes @dgieselaar @smith @gmmorris

banderror commented 3 years ago

One more thing I forgot to mention is about component templates. In the consolidated implementation, I'm proposing the following "architecture" for templates (as you can see here and here in the code):

During index bootstrapping a number of templates will be created by the Event Log mechanism. They will have a certain order of precedence and the "next" template will override properties from all the "previous" ones. Here's the list of templates from "start" (most generic, least precedence) to "finish" (most specific, most precedence):

  1. Mechanism-level .alerts-mappings component template. Specified internally by the Event Log mechanism. Contains index mappings common to all logs (observability alerts, security execution events, etc).
  2. Mechanism-level .alerts-settings component template. Specified internally by the Event Log mechanism. Contains index settings which make sense to all logs by default.
  3. Log-level .alerts-{log.name}-app application-defined component template. Specified and versioned externally by the application (plugin) which defines the log. Contains index mappings and/or settings specific to this particular log. This is the place where you as application developer can override or extend the default framework mappings and settings.
  4. Log-level .alerts-{log.name}-user user-defined component template. Specified internally by the Event Log mechanism, is empty, not versioned. By updating it, the user can override default mappings and settings.
  5. Log-level .alerts-{log.name}-user-{spaceId} user-defined space-aware component template. Specified internally by the Event Log mechanism, is empty, not versioned. By updating it, the user can override default mappings and settings of the log in a certain Kibana space.
  6. Log-level .alerts-{log.name}-{spaceId}-template index template. Its version and most of its options can be specified externally by the application (plugin) which defines the log. This is the place where you as application developer can override user settings. However, mind that the Event Log mechanism has the last word and injects some hard defaults into the final index template to make sure it works as it should.

Template #6 overrides #5, which overrides #4, which overrides #3, etc. More on composing multiple templates in the docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-template.html#multiple-component-templates

Mechanism-level .alerts-mappings and .alerts-settings component templates are managed purely by the mechanism, bootstrapped on rule_registry plugin start, common to all logs defined via the consolidated implementation, and it's not possible for a client (application/plugin) to change them. It's only 2 of them in the cluster.

All the other log-level templates (3 component templates and 1 index template) are created per each log defined via the consolidated implementation.

As a developer defining a log (and thus its mappings and settings), you are able to specify templates #3 (application-defined component template) and optionally #6 (index template).

banderror commented 3 years ago

@marshallmain @madirey @peluja1012 @xcrzx @ecezalp @spong here's my 2 cents on our consolidation efforts and https://github.com/elastic/kibana/pull/101453 vs https://github.com/elastic/kibana/pull/102586

This is all just my subjective opinion, my understanding might be incomplete, and I don't have any strong opinions on anything below.

https://github.com/elastic/kibana/pull/101453

Pros:

Cons:

https://github.com/elastic/kibana/pull/102586

Pros:

Cons:

Random thoughts regarding a system of component templates and their versioning

I really liked the implementation in https://github.com/elastic/kibana/pull/102586 which puts a version in the _meta of each template => which gets propagated to the final index template => which gets propagated to the index:

_meta: {
  "versions": {
      ".alerts-security-solution-index-template": 1,
      ".alerts-security-solution-mappings": 1,
      ".alerts-ecs-mappings": 1,
      ".alerts-technical-mappings": 1
   }
}

I also like the idea of lazy index creation. FWIW, in https://github.com/elastic/kibana/pull/101453 index bootstrapping is also done lazily. The difference is in https://github.com/elastic/kibana/pull/102586 more parts are shared between spaces, and so 2/3 of the bootstrapping happens outside of RuleDataClient (client of a concrete log) and 1/3 inside (index rollover or creation). Whereas in https://github.com/elastic/kibana/pull/101453 most of the index bootstrapping is done by the client of a concrete log.

I'm wondering however what do you think about standardizing templates similar to how I did it in the draft (see https://github.com/elastic/kibana/issues/101016#issuecomment-857593741).

Random thoughts regarding the new agreement on namespacing strategy

Let me copy-paste this to here

Rationale and proposed UI for the {namespace} portion The {namespace} portion of the index naming strategy is meant to provide flexibility for our users to segment the Alerts data in different ways, similar to how the data stream naming strategy has a {namespace} that can be used for different use cases. For example, users might want different ILM policies for the alerts created by some rules compared to others.

We generally don’t want to make a direct connection between this concept of {namespace} and Kibana spaces. Users can choose to include the Kibana space id in the {namespace}, but they might also choose some other means to segment the data.

Therefore, we’d like the {namespace} to be configurable on a per-rule basis via a simple text input widget. The value of the {namespace} is saved in the Rule object. In addition, there will be a Kibana advanced setting per space that defines the default value of this input widget.

In other words, users that would like to separate alerts in different indices by using the space IDs can configure the value of the Kibana advance setting to be equal to the space ID. When creating a new rule in that space, the {namespace} option will be pre-filled with the space ID. They can still change it on every rule if they wish to.

Note about multi-space future: This should continue to work the same way even when multi-space rules become available. The namespace value will be stored on each rule, so it will continue to be tied to the space in which it was created. A user who shares that rule to a 2nd space, for example, would need to decide for themselves the best namespace value in that scenario. A warning could be shown at Rule sharing time indicating the index in which the alerts will be written and giving the user the opportunity to change it. Newly created rules would continue to read in the “default value” from the advanced settings in the current Kibana Space where the rule is being created from.

I think this agreement does not have a lot of effect on any of the two approaches. https://github.com/elastic/kibana/pull/101453 API does not force a developer to bind a log to a Kibana space. It provides a method for resolving a log which accepts a space id in its parameters:

export interface IEventLogResolver {
  resolve<TMap extends FieldMap>(
    definition: IEventLogDefinition<TMap>,
    spaceId: string
  ): Promise<IEventLog<Event<TMap>>>;
}

For its implementation, spaceId is just a string, and we could just rename it to namespace and pass this value from parameters of a rule instance.

Final thoughts

I easily could be just biased towards https://github.com/elastic/kibana/pull/101453, but I feel like it provides a better API in general (althought it can be not perfect, missing certain methods etc).

But given the time left before 7.14 FF and overall fatigue from working on alerts-as-data/event-log/RAC stuff, I think as of now we should proceed with https://github.com/elastic/kibana/pull/102586, because "progress, simple perfection".

We need to ship something and move forward, and with https://github.com/elastic/kibana/pull/102586 theres much more chances to finalize this implementation sooner.

marshallmain commented 3 years ago

Thanks for summarizing Georgii! I'm copy pasting the note I sent to Georgii on slack here to summarize my perspective, developed over the past 2 weeks.

At a very high level, there are 2 primary goals I have in mind for the eventual converged Alert Data client.

  1. Ease of understanding for developers who may not have worked with it before. Since this client is shared across multiple teams and doesn't have clear ownership right now, it's even more critical IMO that it's simple for developers to understand how the client works.
  2. Flexibility as the RAC initiative evolves. Since we're developing a client to be shared across teams, the client must be able to meet both teams requirements (and ideally future teams requirements as well) without requiring significant changes to accommodate them. To me, this implies that the client should act as a library of useful functions for the most part, allowing solutions to compose those functions as they see fit.

I spent 2 days or so comparing the RuleDataClient with the draft consolidated EventLogService to build a mental model of each implementation, assess the existing features, and start estimating the difficulty of adding on the additional required features to each implementation. After those 2 days, I felt very confident in my mental model of the RuleDataClient, but I was still struggling to remember how the abstraction layers fit together in the EventLogService. I could follow the call stack when looking at a specific piece, but the number of calls into various classes meant I would quickly forget the details of what each class did and stored. As for features, your comment does an excellent job summing up the all the key differences we were aware of at the time. When I started digging in to the implementations, I realized that additionally the namespacing strategy differed between them. The EventLogService creates templates and indices at the same time and creates a separate template for each space, whereas the RuleDataClient does not create concrete indices until documents are written and the template creation process is decoupled from the index creation. This positions the RuleDataClient well to support the arbitrary namespaces that users can define for their RAC alerts since a single set of templates already serves all namespaces for a particular solution's alert indices. Due to the requirement for arbitrary namespaces, it looked to me like the EventLogService would have to be updated to decouple the template creation from index creation - in the process breaking the existing template/index versioning and rollover logic in the EventLogService.

So overall, there appeared to be 2 paths forward:

Since (1) there are no blockers preventing the RuleDataClient from being used as the basis for Security Solution rules in RAC - any enhancements are great but not required for shipping so we can more easily make enhancements in parallel with migrating rules, (2) the RuleDataClient is in use already by Observability, and (3) I was more confident in my understanding of the RuleDataClient implementation and thus felt that other devs new to the consolidated client would have an easier time working on it, I decided to try adding the versioning and rollover logic to the RuleDataClient.

marshallmain commented 3 years ago

Summarizing next steps after meeting with @banderror and @xcrzx: Priorities:

Details:

This approach should adopt the declarative style from the event_log. The main difference is it aims to be a lighter-weight implementation that encapsulates template bootstrapping details while providing some flexibility in log definitions. Rather than caching objects for repeated retrieval from the service, the RuleDataPluginService will simply create templates and return a Promise<RuleDataClient>. Solution plugins will be responsible for passing the RuleDataClient around to the appropriate places where it is needed.

banderror commented 3 years ago

I'd like to just add my 2 cents and try to provide a more high-level overview of our recent discussions with @marshallmain and @xcrzx.

We had a chat on the following topics:

@marshallmain perfectly described all the details, I just want to again list what we're going to do in the next PR in a more high-level way:

Again, for more details please refer to https://github.com/elastic/kibana/issues/101016#issuecomment-867830884.

banderror commented 3 years ago

@jasonrhodes @dgieselaar @smith please review ^^^^ and give us some feedback 🙂 Let me know if we should elaborate more on anything. The plan is to start working on the next PR soon, would be great to unblock https://github.com/elastic/kibana/pull/102586 sooner rather than later because it brings changes to the public API of RuleDataService and RuleDataClient.

weltenwort commented 3 years ago

The plan makes sense to me. I'm absolutely in favor of a declarative asset/template API which takes care of the "technical fields". Automatic type derivation sounds nice but a bit of duplication (mapping and io-ts type, for example) might be acceptable to unblock the follow-up efforts sooner IMO.

xcrzx commented 3 years ago

As discussed with @banderror, here are some weak spots in the current implementation that we could address in the consolidated implementation.

1. Index name creation & update logic

Currently, index name creation and update logic is split into two pieces.

rule_registry/server/rule_data_client/index.ts contains index name creation logic:

function createWriteTargetIfNeeded() {
  // ...
  const concreteIndexName = `${alias}-000001`;
  // ...
}

rule_registry/server/rule_data_plugin_service/utils.ts contains index name update logic:

function incrementIndexName(oldIndex: string) {
  const baseIndexString = oldIndex.slice(0, -6);
  const newIndexNumber = Number(oldIndex.slice(-6)) + 1;
  if (isNaN(newIndexNumber)) {
    return undefined;
  }
  return baseIndexString + String(newIndexNumber).padStart(6, '0');
}

incrementIndexName accepts index names created by createWriteTargetIfNeeded and returns undefined if the name doesn't match the pattern. However, the relationship between the two functions is not explicit. And it could lead to bugs if they become unsynchronized. A more robust option would be to have a single function that appends -000001 if the name doesn't contain it or increments the number otherwise.

Also, it should be flexible and not rely on 6 last digits, because there can be edge cases in general (https://github.com/elastic/kibana/pull/102586#issuecomment-875967599).

2. Index mapping update logic

Index mapping update logic has several white spots. Add my concerns as comments to the updateAliasWriteIndexMapping method.

function updateAliasWriteIndexMapping({ index, alias }: { index: string; alias: string }) {
  const clusterClient = await this.getClusterClient();

  const simulatedIndexMapping = await clusterClient.indices.simulateIndexTemplate({
    name: index,
  });
  const simulatedMapping = get(simulatedIndexMapping, ['body', 'template', 'mappings']);
  try {
    await clusterClient.indices.putMapping({
      index,
      body: simulatedMapping,
    });
    return;
  } catch (err) {
    if (err.meta?.body?.error?.type !== 'illegal_argument_exception') {
     /* 
     * This part is unclear. Why do we skip the rollover if we've caught 'illegal_argument_exception'?
     */
      this.options.logger.error(`Failed to PUT mapping for alias ${alias}: ${err.message}`);
      return;
    }
    const newIndexName = incrementIndexName(index);
    if (newIndexName == null) { 
      /* 
       * I think we should not fail here. If the index name had no "-000001" suffix, we should append it.
       */
      this.options.logger.error(`Failed to increment write index name for alias: ${alias}`);
      return;
    }
    try {
     /* 
      * We don't check response here. It could return "{ acknowledged: false }". 
      * Should we consider a rollover to be successful in that case? 
      */
      await clusterClient.indices.rollover({
        alias,
        new_index: newIndexName,
      });
    } catch (e) {
      if (e?.meta?.body?.error?.type !== 'resource_already_exists_exception') {
      /* 
       * This part is also unclear. We log only 'resource_already_exists_exception' but silence all other exceptions. 
       * It looks incorrect.
       */
        this.options.logger.error(`Failed to rollover index for alias ${alias}: ${e.message}`);
      }

       /* 
       * There could be many reasons for the first attempt to update mappings to fail (network, etc.). 
       * I think we should retry in case of failure. 
       */
    }
  }
}

3. Error handling during bootstrapping

Currently, indices bootstrapping considered to be successful even if some requests failed. There is no way to programmatically retrieve the outcome of index template initialization as we only log errors to console. See initialisation in ecurity_solution/server/plugin.ts for example:

const initializeRuleDataTemplatesPromise = initializeRuleDataTemplates().catch((err) => {
  this.logger!.error(err);
});

It could lead to a situation when index bootstrapping fails, but RuleDataClient doesn't know anything about it and writes documents into an index that doesn't have proper mappings. In the worst case, dynamic mappings would be applied to the indexed documents, leading to unexpected behavior, like broken aggregations in the application code.

A safer approach would be to disable all write operations if index bootstrap failed. This is probably should be done on the rule_registry library level.

banderror commented 3 years ago

4. Retry logic during bootstrapping

We should also add some retry logic to the bootstrapping logic.

I talked to Elasticsearch team, and they recommended to add retries to component and index template updates, or IMO wherever else we could get { acknowledged: false } from ES during index bootstrapping.

Here's the log of our discussion:


Q: In Kibana (specifically, RAC project), we started to use the new template APIs for bootstrapping our indices (PR for reference https://github.com/elastic/kibana/pull/102586). This code is not well tested and not production ready, but it’s going to be used from Security and Observability solutions in the upcoming releases, so I’d like to make sure that we implemented it correctly. Specifically, we started to use:

We make sure to wait for the responses from these endpoints before any other code will be able to write anything to the corresponding indices. So any writing happens only if these calls succeed.

In those responses, we normally get 200 OK with the body:

{ acknowledged: true }

I wasn’t able to find it in the docs or elsewhere, so I’d like to double-check two things.

The first thing is about acknowledged: true:

The second thing is about race conditions between index mgmt and indexing requests. Is it possible to get a race condition between successfully acknowledged template mgmt requests, and the following (potentially, immediately after that) attempt to write a document to an index which name matches the name of the updated index template?

When developing locally, we’ve got into some weird broken state where the concrete index ended up with dynamic mappings instead of mappings from the index template. Most likely, this was caused by our incorrectly implemented code + constant server reloads during development. However, I decided to double-check if any race conditions are possible in general.


A:

The first thing is about acknowledged: true:

  • The meaning - does it mean that the request is accepted, but work needs to be done asynchronously in the background?

That means that the change has been acknowledged by the master and accepted with a cluster state update

  • Can we get acknowledged: false in a 200 OK response?
  • What errors can we get there in general, can we refer to some spec with listed errors and response structure?

You can get "false" for the acknowledged, if, when publishing the new cluster state, the request times out before enough nodes have acknowledged and accepted the cluster state

there is not any listed errors, so generally a good rule to consider would be to retry in that case

  • So basically, when creating initial index OR updating existing one OR rolling over alias, is it possible that the concrete index won’t get the most recent mappings and settings from the (just updated) templates?

if the template update response have "acknowledged: true", then retrieving those settings through something like the simulate API or creating a new index will use the just updated template


Q: If this question even makes sense, how many nodes is enough though? Does acknowledged: true guarantee that any subsequent write request will be handled by a node that received the most recent cluster state with the new mappings etc?

Just want to make sure there’s no way to get a race condition, leading to some weird state like let’s say a node which doesn’t know about a newly created index, processes write request and creates an index with dynamic mappings instead of mappings specified in the templates.


A: You shouldn't have to worry about this assuming the request is acknowledged: true, because that means the master node has acknowledged it, and all those requests (index creation, mappings updates) go through the master node

banderror commented 3 years ago

We identified a few potential issues with index bootstrapping and described them in the 2 previous comments.

I'm planning to address them as part of this consolidation task. I can split this up to several PRs, or do it in a single one, not sure at this point which way is better.

@jasonrhodes @weltenwort @marshallmain @xcrzx

marshallmain commented 3 years ago

Thanks for the writeups @xcrzx and @banderror ! Those sound like good improvements to make. The Q&A with answers from the ES team is especially useful, thanks Georgii for contacting them and writing up the answers!

This part is unclear. Why do we skip the rollover if we've caught 'illegal_argument_exception'?

It's inverted, we skip the rollover if we catch anything except for illegal_argument_exception - that's the error returned by ES when the mapping update contains a conflicting field definition (e.g. a field changes types). We expect to get that error for some mapping changes we might make, and in those cases we want to continue on to rollover the index. Other errors are unexpected and should trigger a retry IMO.

We log only 'resource_already_exists_exception' but silence all other exceptions.

This is inverted as well, we silence resource_already_exists_exception and log all others.

I think we should not fail here. If the index name had no "-000001" suffix, we should append it.

I'm a little worried about appending to the index name in the case of failures without knowing what caused the failure. Appending might fix the case where the suffix gets dropped somehow, but it could also introduce other edge cases. If the suffix gets dropped somehow during the rollover process and we fail to find the suffix on the next check and start with "-000001" again, then "-000001" could already exist. In that case it would appear that the rollover process has completed successfully when really nothing has happened.

I think the most robust approach would be to go back to using the dry_run parameter on the rollover API to retrieve the next_index name directly from Elasticsearch and avoid any attempts to replicate the rollover naming logic ourselves. We pay a slight performance cost by making an additional call to ES, but we're only doing this once at startup time so it seems worth it to minimize bug potential here.

We don't check response here. It could return "{ acknowledged: false }". Should we consider a rollover to be successful in that case?

This is a great point and something we should definitely handle. "{ acknowledged: false }" seems like a failure to me.

There could be many reasons for the first attempt to update mappings to fail (network, etc.). I think we should retry in case of failure.

Adding retry logic for the mapping update process in the case of (unexpected) errors would definitely be a good improvement.

A safer approach would be to disable all write operations if index bootstrap failed. This is probably should be done on the rule_registry library level.

Definitely agree with this, we should not allow writes if the templates are not installed correctly.

xcrzx commented 3 years ago

It's inverted, we skip the rollover if we catch anything except for illegal_argument_exception ...

Sorry, my bad, didn't see that little !. Thank you for the clarification. I'll add it as a comment to the code.

If the suffix gets dropped somehow during the rollover process and we fail to find the suffix on the next check and start with "-000001" again, then "-000001" could already exist.

The current implementation doesn't handle this case either. It fails when the index name doesn't contain a numerical suffix. So yeah, going back to using the dry_run is probably the best option we have.

banderror commented 3 years ago

Adding @marshallmain's suggestions regarding index bootstrapping:

  1. There is currently no ILM policy associated with the concrete indices created by the RuleDataClient. We can start by associating all concrete indices with the default ILM policy created by the rule registry. However, we also need to add an additional index template per index alias in order to specify the rollover_alias for the concrete index. (https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle-management.html#ilm-gs-alias-apply-policy) Without an ILM policy, the index will eventually become much larger than necessary and could hinder performance.
  2. In the security solution, we'd like to add an additional alias to the alerts-as-data indices in order to make them backwards compatible with existing user defined queries and visualizations built on the .siem-signals naming scheme. To accomplish this, we'd add an additional parameter to RuleDataClient that allows callers to specify an optional "secondary alias" that would be added to the concrete indices.

A note regarding ILM policy, index template and namespace:

banderror commented 3 years ago

Adding suggestions after syncing with Observability folks (cc @weltenwort, btw I don't know GitHub usernames of Kerry and Panagiota, so please feel free to tag them):

  1. Enforce allowed values for the datasetSuffix on the API level. In the index naming convention .alerts-{registrationContext}.{datasetSuffix}-{namespace}, datasetSuffix can only be alerts or events. It would make sense to restrict it in the RuleDataService interface to make it safer and more explicit for developers.
  2. Provide different default values for namespace:
    • in getWriter set namespace = 'default' by default
    • in getReader maybe set namespace = '*' by default
banderror commented 3 years ago

Just want to socialise here that something like https://github.com/elastic/kibana/issues/101541 (a higher-level abstraction on top of index management) is out of scope of this consolidation effort.

weltenwort commented 3 years ago

cc @Kerry350 @mgiota

mgiota commented 3 years ago

@banderror I took the liberty to add one more ticket in your list for 7.15 (@jasonrhodes can you confirm this is for 7.15?) This is a new bug that we spotted while working on https://github.com/elastic/kibana/pull/110788.

banderror commented 3 years ago

@mgiota Thank you, I can only support that 👍 👍

banderror commented 2 years ago

Hey everyone, I removed this ticket from the backlog of the Detection Rules area. We (@elastic/security-detections-response-rules) are not the owners anymore (however feel free to still ping us if you have any tech questions about the ticket).

Ownership of this epic and its sub-tasks now goes to the Detection Alerts area (Team:Detection Alerts label). Please ping @peluja1012 and @marshallmain if you have any questions.

marshallmain commented 2 years ago

Transferring again to @elastic/response-ops as they now own the rule registry implementation.