elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.61k stars 8.22k forks source link

[Platform][Alerting] Reducing the cost of change in data migrations #96291

Open gmmorris opened 3 years ago

gmmorris commented 3 years ago

Context

As the RAC initiative progresses, we're making decisions about the shape of the Alerts-as-data indices and various Saved Objects. These decisions are based on what we know now about the needs of Security, O11y and Stack Rules (including ML and Maps), but I think it's safe to assume we're going to get some thing wrong, and other things are likely to change (as new requirements come in, and we broaden the scope to include other features as part of 8.x).

As thing stand, the cost of change to the shape of data (both in system/hidden indices and SOs) is so high that we tend to think of these as prohibitive. This fear has historically made it hard for the Alerting team to make decisions about the correct generic approach, slowing us down.

I'd like to use this issue to discuss how we can change that.

Key Challenges

Below are the key challenges, making the cost of change higher than we'd like it to be:

  1. There is no OOTB method for migrating system/hidden indices. SIEM currently have a curl-based manual index migration, but there is no system in place to ensure this migration is called correctly as part of an upgrade (as we have with SO migrations). If we need to migrate a system/hidden index, the new should provide a platform level service for this.
  2. As Alerts-as-data grows in use, we expect many customers to accumulate huge data indices full of alerts. It doesn't seem reasonable to expect historical indices to be migrated when this schema changes, but neither can we afford to lock ourselves into a schema that was designed in 7.14 forever. How do we support incremental changes to these schema from minor to minor while still supporting historical data?
  3. There are no cross-type Saved Object migrations. For instance, each rule SavedObject has a taskId field that points at a task SO type. We need to pluck the API keys from the rule, and migrate it onto the task that is referenced by the taskId field. We have no way of doing this.

cc @kobelb @spong @tsg @alexh97 @mfinkle @elastic/kibana-core

elasticmachine commented 3 years ago

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

mfinkle commented 3 years ago

I can't comment on much here, but for Point 2, I do think we should think we could leverage Runtime Fields to define as much of the "schema" as possible to mitigate against inevitable Alert schema changes. We could keep the main body of the Alert in a jSON blob or something, and use Runtime Fields to define the specific fields pulling data from the blob. As the schema (in the blob) changes, we'd update the Runtime Field definitions.

For Point 3, we could see where cross-type SO migrations fall on the Kibana Core roadmap (if we've considered them yet)

gmmorris commented 3 years ago

I can't comment on much here, but for Point 2, I do think we should think we could leverage Runtime Fields to define as much of the "schema" as possible to mitigate against inevitable Alert schema changes. We could keep the main body of the Alert in a jSON blob or something, and use Runtime Fields to define the specific fields pulling data from the blob. As the schema (in the blob) changes, we'd update the Runtime Field definitions.

That's an interesting thought, which hadn't actually crossed my mind. @spong - How feasible would such an approach be given the kinds of aggregations, and searches you expect to run on the data in alerts-as-data? 🤔

gmmorris commented 3 years ago

For Point 3, we could see where cross-type SO migrations fall on the Kibana Core roadmap (if we've considered them yet)

🙏 I don't think this is needed until the 8.x timeframe. (am I right, @mikecote ?)

mikecote commented 3 years ago

For Point 3, we could see where cross-type SO migrations fall on the Kibana Core roadmap (if we've considered them yet)

🙏 I don't think this is needed until the 8.x timeframe. (am I right, @mikecote ?)

Correct, I don't think we'll need this before 8.x.

spong commented 3 years ago

Thank you so much for raising this @gmmorris! This has been something that's been a consistent pain point on the Detections side as we've continued to add more and more features since debuting back in 7.6. It was in preparation for GA (and a full backlog of features itching to change the mapping :) that lead us to create the Detection Alerts Migration API to at least provide some helper to our users in the event we would need to make a migration.

For the most part we've been able to just add the new mappings with the only annoyance being that a privileged user would need to visit the Security App to trigger a rollover for the new mappings to be available, which left us in an interesting place with exposing new features that might not be supported till the right user visited the app. This shouldn't be an issue with RAC though since we're using the kibana_system_user for index mgmt (and a rollover can be triggered on rule-registry start post-upgrade). We did have one incorrectly mapped field, but with where we were at we were able to get by without a migration. @tsg has recommended field aliases to get around similar issues, but we haven't had to go that route just yet.

Either way, a platform level service for performing migrations on upgrade would go a long way in both ensuring the flexibility is there if needed, and putting us in a position to quickly iterate towards the ideal implementation without the need to over-engineer it up front in order to support what we think we may need for the next two years without a breaking change.

That's an interesting thought, which hadn't actually crossed my mind. @spong - How feasible would such an approach be given the kinds of aggregations, and searches you expect to run on the data in alerts-as-data? 🤔

Frankly, we haven't leveraged too many aggregations on top of alerts just yet (only the Alerts Histogram/Last Alert component), so it hasn't really been an issue. That said, we'll probably start introducing them more and more as we start providing advanced alert triage flows with groupings. Either way, I do like the approach of using runtime fields as the trialing of new fields before they get added to the main schema. I'm working on a runtime field reference rule, so should hopefully know a bit more here soon with regards to what that may look like. (This will be crucial for users to add their own fields to alerts since we won't be supporting composable index templates.)

pgayvallet commented 3 years ago

There are no cross-type Saved Object migrations. For instance, each rule SavedObject has a taskId field that points at a task SO type. We need to pluck the API keys from the rule, and migrate it onto the task that is referenced by the taskId field. We have no way of doing this.

That's a different need than what is specified in https://github.com/elastic/kibana/issues/91143, right? #91143 is about migrating type A objects to type B, where what you're asking here is to be able to access arbitrary objects from type B during the migration of objects of type A, right?

gmmorris commented 3 years ago

There are no cross-type Saved Object migrations. For instance, each rule SavedObject has a taskId field that points at a task SO type. We need to pluck the API keys from the rule, and migrate it onto the task that is referenced by the taskId field. We have no way of doing this.

That's a different need than what is specified in #91143, right? #91143 is about migrating type A objects to type B, where what you're asking here is to be able to access arbitrary objects from type B during the migration of objects of type A, right?

Correct 😬 It's something that only came up in this issue, and AFAIK it's the only instance we've had of this need so far.

gmmorris commented 3 years ago

Thank you so much for raising this @gmmorris! This has been something that's been a consistent pain point on the Detections side as we've continued to add more and more features since debuting back in 7.6. It was in preparation for GA (and a full backlog of features itching to change the mapping :) that lead us to create the Detection Alerts Migration API to at least provide some helper to our users in the event we would need to make a migration.

Thanks for the context :)

For the most part we've been able to just add the new mappings with the only annoyance being that a privileged user would need to visit the Security App to trigger a rollover for the new mappings to be available,...

Could you expand on this @spong ? Did you actually migrate the historical data itself? Or just the mappings?

Judging by the docs, I'm guessing it's only the mappings:

While the process is neither destructive nor interferes with existing data, it may be resource-intensive. As such, it is recommended that you plan your migrations accordingly.

I don't think we'd have much of an option of migrating actual data (as the amount of data is going to be huge), but I do wonder if there are changes we will find hard to make i the future if we're limited to only mappings changes. 🤔 I don't have much experience at migrating large systems built on the stack... any thoughts here from past experience?

...which left us in an interesting place with exposing new features that might not be supported till the right user visited the app.

This is exactly why I feel we need to discuss formalising this kind of migration. Otherwise we basically break alerting until there's user intervention and we have no clear way to express this to the user.

This shouldn't be an issue with RAC though since we're using the kibana_system_user for index mgmt (and a rollover can be triggered on rule-registry start post-upgrade). We did have one incorrectly mapped field, but with where we were at we were able to get by without a migration. @tsg has recommended field aliases to get around similar issues, but we haven't had to go that route just yet.

Either way, a platform level service for performing migrations on upgrade would go a long way in both ensuring the flexibility is there if needed, and putting us in a position to quickly iterate towards the ideal implementation without the need to over-engineer it up front in order to support what we think we may need for the next two years without a breaking change.

++

Any chance you could enumerate specific features you'd expect this platform service to provide?

That's an interesting thought, which hadn't actually crossed my mind. @spong - How feasible would such an approach be given the kinds of aggregations, and searches you expect to run on the data in alerts-as-data? 🤔

Frankly, we haven't leveraged too many aggregations on top of alerts just yet (only the Alerts Histogram/Last Alert component), so it hasn't really been an issue. That said, we'll probably start introducing them more and more as we start providing advanced alert triage flows with groupings. Either way, I do like the approach of using runtime fields as the trialing of new fields before they get added to the main schema. I'm working on a runtime field reference rule, so should hopefully know a bit more here soon with regards to what that may look like. (This will be crucial for users to add their own fields to alerts since we won't be supporting composable index templates.)

Brilliant, that's a good start. Is it worth adding an action item on the Alerts-as-data issue to identify parts of the schema that we should/could express as runtime field instead of on the mappings? 🤔 Or would you rather leave that as something to do at a later stage (assuming we can modify mappings after the fact as part of upgrades)?

rudolf commented 3 years ago

Runtime fields and aliases can be helpful ways to reduce the impact of this problem, but for any domain there's always inevitably something new you learn that requires you to model your data in a completely different way where just changing the mappings isn't enough, you also need to change your documents.

Ultimately we want to help developers shift this complexity to the migration system so that their business logic is free from if/else clauses dealing with all the past modelling mistakes.

To solve (1) and (2) I've been thinking of a slightly different migration algorithm that's eventually consistent in order to give us the performance over large data sets: https://github.com/elastic/kibana/issues/96626

For (3) the naive solution is to allow performing an async read for every transformed document, but this would be very inefficient. The more performant solution is what we documented in https://github.com/elastic/kibana/issues/34996 which would basically allow the migration to collect some data before it starts and then use that data for transforming every document. This means you need to be able to fit all the data you need in memory but this is probably sufficient for most of the migrations we're doing.

pgayvallet commented 3 years ago

@rudolf should we reopen https://github.com/elastic/kibana/issues/34996 then (or is it too soon in the discussion)?

gmmorris commented 3 years ago

Runtime fields and aliases can be helpful ways to reduce the impact of this problem, but for any domain there's always inevitably something new you learn that requires you to model your data in a completely different way where just changing the mappings isn't enough, you also need to change your documents.

Ultimately we want to help developers shift this complexity to the migration system so that their business logic is free from if/else clauses dealing with all the past modelling mistakes.

To solve (1) and (2) I've been thinking of a slightly different migration algorithm that's eventually consistent in order to give us the performance over large data sets: #96626

I tend to agree with all of that. :)

Regarding https://github.com/elastic/kibana/issues/96626: I think it's a great idea. Given the problems we've seen these past few weeks with the growing task and action SOs, this would reduce the likelihood of breaking a customer upgrade. One thing to note though - this issues is not, predominantly, about SOs, as Alerts-as-data are hidden/system ECS indices, rather than SO... so there's both the aspect of eventually consistent migrations in SOs, and formalizing migrations of freeform / ECS indices.