elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.66k stars 8.1k forks source link

Improve data plugin handling of `resolve` #111196

Closed jportner closed 7 months ago

jportner commented 2 years ago

Overview

To support #100489, the data plugin needs to be able to use SavedObjectsClient.resolve to load ~index patterns~ data views. This way, "deep link" URLs that depend on an index pattern ID (such as a global filter?) will not break after the 8.0 release.

The issue to track that work is in #108335.

The current plan is to change all of the data plugin's usage of SavedObjectsClient.get under the hood to use SavedObjectsClient.resolve instead. If a conflict outcome occurs, it will throw an error. Otherwise, consumers will be unaffected and continue operating as expected.

Eventually we intend to make additional changes so that the outcome is passed down to consumers, and they can handle it accordingly -- that is not tracked in an issue yet, it will need to happen sometime in the 8.x timeframe.

What's the problem?

As mentioned above:

The current plan is to change all of the data plugin's usage of SavedObjectsClient.get under the hood to use SavedObjectsClient.resolve instead.

This means that all downstream consumers of the data plugin will be using resolve more often than needed. I don't have a very clear picture of the exact impact, but a cursory search reveals that 62 plugins depend on the data plugin.

I'll copy what I wrote in #111191:

As mentioned in the Sharing Saved Objects developer guide (FAQ 6), saved objects should only retrieved with SavedObjectsClient.resolve when loaded through a deep link URL. There are several reasons for this:

  1. resolve is less performant
  2. resolve is a temporary stop-gap to get us over this conversion hump without breaking existing deep link URLs, but we eventually want to get rid of it. To that end, we are trying to minimize usage of resolve, and we are collecting usage data on how often it's used / what the different outcomes are.
  3. We want to measure how often users are running into aliasMatch and conflict outcomes versus the exactMatch outcome; this usage data gets diluted if plugins call resolve in a situation where the outcome will always be `exactMatch1.
  4. Conversely, consumers can run into undesired aliasMatch and conflict outcomes more often than necessary if they use resolve in non-deep-link situations.

What can we do about it?

Ultimately the "correct solution" to minimize usage of resolve is to add a separate resolve API to the data plugin's IndexPatternsService. Doing this would mean that we would need to identify any downstream consumers of the data plugin who rely on index pattern IDs in deep link URLs, and change them to use resolve instead.

We have two options to move forward:

  1. ~Continue with the current plan for now and implement the "correct solution" later in 8.x~
  2. ~Implement the "correct solution" now, before the 8.0 release~

Edit for clarity (see https://github.com/elastic/kibana/issues/111196#issuecomment-914315002):

Edit 2: we will proceed as planned with Option 1, see https://github.com/elastic/kibana/issues/111196#issuecomment-929639351 for details

Option 1

~Option 2~


I'm afraid that I don't have a full idea of the impact of approach (1), and I don't have a good enough understanding of the data plugin's consumers to really have a full idea of the work involved in implementing the "correct solution".

I am nervous that 62 plugins depend on the data plugin, and I am afraid that the current approach may cause usage of SavedObjectsClient.resolve to explode by an order of magnitude.

elasticmachine commented 2 years ago

Pinging @elastic/kibana-security (Team:Security)

elasticmachine commented 2 years ago

Pinging @elastic/kibana-app-services (Team:AppServices)

kobelb commented 2 years ago

Thank you for raising this concern @jportner, it's an important one. Whatever code either directly or transitively relies on SavedObjectsClient.resolve has to handle the possibility of a conflict occurring, which adds complexity for consumers. As a result, we really should limit its usage to where absolutely necessary.

If we were to go with option 1, would the plan be to have the IndexPatternsService.get use SavedObjectsClient.resolve internally by 8.0. Until these saved-objects are truly shareable, there won't be the potential for a conflict, so for 8.0 we can hide the fact that there's a potential conflict. Then, in an effort to make index-patterns shareable, we'd add an IndexPatternsService.resolve method and use that where appropriate and switch IndexPatternsService.get to use SavedObjectsClient.get?

jportner commented 2 years ago

Yeah, unfortunately as soon as objects are made share-capable in 8.0, you can experience a conflict outcome from a resolve. My biggest fear is that with option 1, we open ourselves up to a scenario like this:

A user has some 3rd party integration (or Terraform or something else) that is designed to create/recreate saved objects using our HTTP APIs. They upgrade to 8.0, and their integration runs, and a bunch of index patterns get created with their old IDs. This means that whenever a ~index pattern~ data view loads anywhere in the Kibana client, it throws an error. So that's... probably most Kibana apps that are affected.

If we did go with option 2, data views would only break where they are included in a deep link. So we shield our users from that impact.

kobelb commented 2 years ago

Gotcha, thanks for correcting my understanding @jportner. Option 1 does sound like a very bad user experience and something that we would absolutely want to avoid.

With option 2, would anything be breaking or would the user just be shown a dialog alerting them that there was a conflict and allow them to proceed?

ppisljar commented 2 years ago

I think we should mention in the issue that the same applies to saved queries saved object as well.

It feels to me option 2 is a huge effort and definitely requires careful design and planning. To just briefly describe what we are looking at: Every consumer of index patterns would need to start using indexPatterns.resolve (instead of .get) and correclty handling the response. Eventually this information needs to propagate all the way up to the application, as it responsible for managing the url.

Lets look at our dashboard application for example:

What I am trying to say is that with 62 plugins depending on data and our deep dependency trees at the moment its impossible to estimate the complexity and amount of work we are looking towards so planning it for 8.0 seems unrealistic to me.

timroes commented 2 years ago

We have two options to move forward:

Continue with the current plan for now and implement the "correct solution" later in 8.x Implement the "correct solution" now, before the 8.0 release

My understanding was, that we discussed exactly that in a meeting on August 11th and decided to not implement the "correct solution" until we rediscuss whether it's really needed sometime in 8.x. Could we clarify if something has changed since then, to make us rethink this decision?

Yeah, unfortunately as soon as objects are made share-capable in 8.0, you can experience a conflict outcome from a resolve.

I am a bit confused by that. My understanding (from whenever I asked that question), was that a conflict can never naturally occur and only if users will mess with saved objects. If this is a state that weill be naturally appearable from doing things within the UI, I think this def has more severity than so far. Or can it still only happen via API (as in the scenario described above)?

I am also not entirely sure, how will adding a indexPatterns.resolve API solve the described problems? This API itself will call resolve again and "forward" the result. Then apps go and call this method instead, so we're still calling resolve very often (i.e. not reducing the performance drawback it might bring)?

Is my understanding correct that we only wouldn't need to call it in those places, where we are 100% sure the index pattern ID ran through the saved object migration? Which for a lot of the core applications is never true, due to the fact we're handling state within the URL - which we eventually want to get rid of, but are def still quiet far away from that.

jportner commented 2 years ago

I realize what I wrote may have been a bit confusing, so let me start by clarifying:

Eventually we intend to make additional changes so that the outcome is passed down to consumers, and they can handle it accordingly -- that is not tracked in an issue yet, it will need to happen sometime in the 8.x timeframe. ... Ultimately the "correct solution" to minimize usage of resolve is to add a separate resolve API to the data plugin's IndexPatternsService. Doing this would mean that we would need to identify any downstream consumers of the data plugin who rely on index pattern IDs in deep link URLs, and change them to use resolve instead.

We have two options to move forward:

  1. Continue with the current plan for now and implement the "correct solution" later in 8.x
  2. Implement the "correct solution" now, before the 8.0 release

I started by mentioning that we are not attempting to pass the resolve outcome to consumers. I did not mean to suggest that we should attempt to do that in Option 2. I just wanted to point out that will be an improvement we need to make down the road.

So let me restate, here is what I was attempting to suggest:

Option 1

Option 2


Gotcha, thanks for correcting my understanding @jportner. Option 1 does sound like a very bad user experience and something that we would absolutely want to avoid.

With option 2, would anything be breaking or would the user just be shown a dialog alerting them that there was a conflict and allow them to proceed?

@kobelb Option 1 and Option 2 both break the user experience in 8.0 in the case of a "conflict" outcome, because they both throw an error. It's up to consumers to deal with that error.

I think we should mention in the issue that the same applies to saved queries saved object as well.

It feels to me option 2 is a huge effort and definitely requires careful design and planning. To just briefly describe what we are looking at: Every consumer of index patterns would need to start using indexPatterns.resolve (instead of .get) and correclty handling the response. Eventually this information needs to propagate all the way up to the application, as it responsible for managing the url.

@ppisljar I'm sorry, I understand that I wasn't clear. That isn't what I meant to suggest.

All I'm saying is that in option 2, we identify the consumers that need to use resolve, and change them to use it. That way we reduce the reliance on resolve. That would just be a one-line code change for each consumer.

All of the outcome handling can be tackled later in 8.x, in my opinion.

Lets look at our dashboard application for example:

  • Index patterns are user inside filters, so filters would need to use resolve and provide a way to let the application know about different outcomes so app can react correctly
  • index patterns are used inside most of the panels for example inside visualization, where dependency tree gets deep: visualize embeddable -> specific visualization -> expressions -> specific expression function -> search source -> search strategy

What I am trying to say is that with 62 plugins depending on data and our deep dependency trees at the moment its impossible to estimate the complexity and amount of work we are looking towards so planning it for 8.0 seems unrealistic to me.

Thank you for the example, that really helps frame things. With my clarified suggestion on Option 2, do you think that's more realistic to potentially do for the 8.0 release?

My understanding was, that we discussed exactly that in a meeting on August 11th and decided to not implement the "correct solution" until we rediscuss whether it's really needed sometime in 8.x. Could we clarify if something has changed since then, to make us rethink this decision?

@timroes sorry about the confusion. Correct me if I am misremembering, but I was under the impression that on the August 11th meeting, we decided to defer exposing the resolve outcome to consumers. I 100% agree with that and I was not trying to suggest we tackle that before the 8.0 release. However, I did not walk away from that meeting with the impression that all consumers of the data plugin were going to transparently use resolve under the hood. If that's what was stated and agreed upon by the rest of the group, I apologize for missing that. As soon as I discovered the app services team's actual implementation plan for 8.0, I created this issue.

I am a bit confused by that. My understanding (from whenever I asked that question), was that a conflict can never naturally occur and only if users will mess with saved objects. If this is a state that weill be naturally appearable from doing things within the UI, I think this def has more severity than so far. Or can it still only happen via API (as in the scenario described above)?

Yes, a "conflict" should not occur during normal Kibana usage when you create, share, import, or copy saved objects. However, any integration that creates/recreates saved objects with hardcoded IDs can inadvertently trigger a legacy URL conflict scenario. For example, when the fleet plugin installs an integration and creates "Kibana assets" (dashboards, visualizations, etc), it does this today. Same with beats that use the legacy import API. We are attempting to make sure that we cover all of our bases for such 1st party integrations, but that leaves 3rd party integrations unaccounted for. We have usage data indicating that there is a non-negligible amount of 3rd party API usage that could result in conflict scenarios.

So, in my opinion, we need to treat these conflict scenarios as real possibilities and handle them in the best way we possibly can for the 8.0 release.

I am also not entirely sure, how will adding a indexPatterns.resolve API solve the described problems? This API itself will call resolve again and "forward" the result. Then apps go and call this method instead, so we're still calling resolve very often (i.e. not reducing the performance drawback it might bring)?

What I am trying to suggest is that the data plugin should expose two separate APIs, get and resolve. Only consumers that load index patterns with an ID from a deep link URL should attempt to use resolve. All other consumers should continue using get.

Assuming that we do have a legacy URL conflict scenario:

Is my understanding correct that we only wouldn't need to call it in those places, where we are 100% sure the index pattern ID ran through the saved object migration? Which for a lot of the core applications is never true, due to the fact we're handling state within the URL - which we eventually want to get rid of, but are def still quiet far away from that.

Can you clarify how many consumers rely on deep link URLs to load index pattern IDs? Out of those 62 plugin consumers, it would still be a minority, right?

petrklapka commented 2 years ago

Sorry if this is a naive suggestion from someone who's never developed on Kibana, but... Why not embed something in the ID string that identifies the "new" generation of IDs as new? If I understand correctly from speaking with Peter P. and Stacey, the ID's are just strings. Why not embed some meta data in the ID, like an ID version (Example: "IDRev:2.0.0:ID:TheActualIDString"). The resolve function would then be immediately able to tell if this was a new or old ID and there would be no chance for a conflict. The resolve function could key off of this meta data to improve performance, since it would immediately be able to tell an old ID from a new one, and in the case of old IDs, skip looking for them in whatever collection/repository is used to hold the new IDs. I would imagine the ID generator would need to be modified to do this, but other touch points should be minimal.

jportner commented 2 years ago

Sorry if this is a naive suggestion from someone who's never developed on Kibana, but... Why not embed something in the ID string that identifies the "new" generation of IDs as new? If I understand correctly from speaking with Peter P. and Stacey, the ID's are just strings. Why not embed some meta data in the ID, like an ID version (Example: "IDRev:2.0.0:ID:TheActualIDString"). The resolve function would then be immediately able to tell if this was a new or old ID and there would be no chance for a conflict.

@petrklapka we do deterministically generate new object IDs. However, the new ID isn't the problem here, it's the old ID.

Let's say that you have an object "123", and then you upgrade Kibana to 8.0, and the object's ID gets changed to "456". During the upgrade, Kibana also creates an alias for "456 -> 123". So, after upgrade, if you try to get(123), the object is not found -- that object no longer exists. That's why we need to use resolve(123) instead; if you do that, Kibana checks for an alias, finds that you are actually looking for object "456" instead, and returns it.

However, let's say that after the upgrade, you (re)create object "123". Now, you have two objects: "123" and "456". This is what I have been calling a "conflict scenario"; it enables the "conflict" outcome.

Now when you resolve(123), Kibana finds two objects, and it doesn't know which one you are actually looking for. That is the "conflict" outcome.

The resolve API actually returns the "most correct" match in this situation (which in this case is object "123", not object "456"). But we also think it's important in these situations to surface some sort of UI to the user, so they are aware that, hey, this might not actually be the object you're looking for, there's another one that matches this URL. The thing is that with App Services concerns, fetching these saved objects is abstracted away from the application itself, so we don't have an opportunity to put a nice UI around it. That's why stakeholders decided to throw an error in this situation instead.

The resolve outcomes are detailed in the docs here along with lots of other info about why the object IDs need to be changed, etc.


That said:

We have controls in place to prevent conflict scenarios from occurring in the first place when you 1. import, 2. copy, or 3. share objects. We currently have no such guardrails fo the direct SOC create/bulkCreate APIs. Aside from tampering with raw ES documents, using create/bulkCreate directly with "old" object IDs is the only way that a conflict scenario can occur. This should not happen organically but it could with an integration.

After speaking with tech leadership some more I think there is more we can do to try to prevent this scenario at create/bulkCreate time. That still leaves us with the other problems I outlined above RE: overusing resolve, but it would reduce or eliminate the biggest one, which is the conflict/error situation that we hope to avoid.

So I'm going to submit an RFC for that. In the meantime, I think we can put a pin in this topic and revisit after the RFC is accepted or rejected :+1:

ppisljar commented 2 years ago

@jportner i guess what @petrklapka is suggesting is that the new objects would look like new-456 instead of 456 and when you (re)create object 123 it will actually get id new-123 thus conflict is not possible. .resolve can also know that if id does not start with new- it needs to look in alias.

and thanks for clarifying we don't need to do anything different for 8.0, regarding 8.x it still seems a big task but we can figure out later what/how exactly to do it.

stacey-gammon commented 2 years ago

thanks for clarifying we don't need to do anything different for 8.0,

I may be misunderstanding this comment. Is the App Services team still planning to switch over usages of .get to .resolve where an un-migrated id may be encountered (e.g. URLs)?

Eliminating the need to handle conflict scenarios makes it seem like the team can use resolve, even if it will be called with migrated ids as well as un-migrated ids. Not ideal, but at least no errors will be encountered and we can slowly figure out how to switch over to .get?

Perhaps this question can wait for the RFC, however, I'm wondering how we will know when it's safe to switch calls from .resolve to .get. Will we be able to track how many old ids are passed through a .resolve call? Will we be able to know where the .resolve call originated from so we can flip them to .get on a case by case basis?

jportner commented 2 years ago

So I'm going to submit an RFC for that. In the meantime, I think we can put a pin in this topic and revisit after the RFC is accepted or rejected 👍

The RFC was approved and I created an issue for the work, which I plan to begin this week: #113335

Given that, our concerns regarding conflict scenarios are largely alleviated, since they should not be able to happen organically either by using the Kibana app or using its HTTP APIs.

I think we should leave this issue open to track that work needs to be done, but it's safe to say that we can proceed as planned with Option 1 as outlined in this issue's description 🎉

and thanks for clarifying we don't need to do anything different for 8.0, regarding 8.x it still seems a big task but we can figure out later what/how exactly to do it.

@ppisljar Right, we just need to make sure that any method the data plugin exposes to consumers uses .resolve under the hood before the 8.0 release. Other work to try to reduce our usage of .resolve can wait until later.

I may be misunderstanding this comment. Is the App Services team still planning to switch over usages of .get to .resolve where an un-migrated id may be encountered (e.g. URLs)?

Eliminating the need to handle conflict scenarios makes it seem like the team can use resolve, even if it will be called with migrated ids as well as un-migrated ids. Not ideal, but at least no errors will be encountered and we can slowly figure out how to switch over to .get?

@stacey-gammon Agreed, this was my intent as well.

Perhaps this question can wait for the RFC, however, I'm wondering how we will know when it's safe to switch calls from .resolve to .get. Will we be able to track how many old ids are passed through a .resolve call? Will we be able to know where the .resolve call originated from so we can flip them to .get on a case by case basis?

I mentioned in the RFC, but will restate here for the general public 😄 we are collecting usage data on .resolve, we cannot trace with such fine detail but we will be able to tell 1. how often resolve is being used, and 2. how often users are experiencing each outcome.