[APM] ML jobs are limited to a single space

elastic / kibana

Your window into the Elastic Stack

https://www.elastic.co/products/kibana

Other

19.66k stars 8.23k forks source link

[APM] ML jobs are limited to a single space #119975

Open sorenlouv opened 2 years ago

sorenlouv commented 2 years ago

Problem ML jobs created through APM is currently only displayed in the space where they were created.

This means that even though the user has the exact same 5 services across all their spaces, they need to create ML jobs in every one of them. This both leads to a bad user experience and a performance overhead since we need duplicate ML jobs.

Expectation ML jobs should be space agnostic.

elasticmachine commented 2 years ago

Pinging @elastic/apm-ui (Team:apm)

jgowdyelastic commented 2 years ago

The module setup endpoint can take the boolean argument applyToAllSpaces. If set to true, all jobs created by the module will be put in the * space

sophiec20 commented 2 years ago

I am not entirely sure that ML jobs should be space agnostic. Spaces are used to segment access to certain data sources. Results from ML jobs contain a subset of the data that has been analysed. It is reasonable to think that a customer who wishes to segment (and restrict access to) their APM data in-app, will also want to segment (and restrict access to) their ML results. If this is the case, then the ML job should be space aware.

sorenlouv commented 2 years ago

Thanks for chiming in @sophiec20 . Nothing is decided yet, and things will most likely change as we start working on service groups (can't link to design doc since it's in a private repo but can send on slack). Either way, my current thoughts on this:

Ideally I think the ML jobs should follow the APM index pattern settings. So by default all spaces will read from traces-apm*,metrics-apm*. In that case I don't see the need for separate ML jobs.

However, if the user changes the index patterns for a space, eg traces-apm-custom-region*,metrics-apm-custom-region* I can see that it could make sense to have a separate ML job here.

It is currently not possible to modify index pattern settings per space but it is something I expect us to add in the future.

It is reasonable to think that a customer who wishes to segment (and restrict access to) their APM data in-app

Afaik spaces cannot be used to restrict access to APM data in a secure manner since the user will still be able access the raw data via Dev Tools or Discover (or by requesting ES directly). To restrict access admins will need to use document level security (which won't apply to ML jobs afaict) or ingest data to separate data stream namespaces and then restrict access on an index level.

jportner commented 2 years ago

@peteharverson asked for my 2 cents on this--

It sounds reasonable to take some steps to avoid duplication of ML jobs. Keep in mind that the end user must have write access to the ML privileges in * All spaces to be able to create an ML jobs in all spaces.

The "Share saved objects to spaces" flyout contains a space selector that checks for the user's privileges and presents them with an option to select spaces.

ml-job

The Security plugin exposes an endpoint to check these privileges. If the user is not authorized to share the object to All Spaces, that option is disabled and a tooltip icon is displayed indicating why.

It sounds like APM could benefit from a component that behaves similarly -- perhaps it could just allow the user to select "This space" or "All spaces" for simplicity? Fleet will have a similar need in the future WRT "package assets" (dashboards, visualizations, etc. that are installed with integrations)

I don't have all of the context on this, but based on what I know, here's what I propose:

When creating an ML Job through APM:

Display a control component that allows the user to choose between "All spaces" or "This space"
- "All spaces" will be the default option
- If the user does not have the required privileges, fall back to "This space"
In the future, when APM indices can be customized on a per-space basis, if that has been customized, change the default option for this control to "This space" instead (and possibly show some sort of warning/callout below it)

If that sounds desirable, the Platform Security team can work on exposing a reusable control component for you to consume.

Related: #49647

sorenlouv commented 2 years ago

@jportner Thanks for chiming in!

It sounds like APM could benefit from a component that behaves similarly -- perhaps it could just allow the user to select "This space" or "All spaces" for simplicity?

I don't think this will provide a good user experience. Imagine the user has the following three spaces with separate index settings:

Space 1: apm-foo-*
Space 2: apm-foo-*
Space 3: apm-bar-*

If an ML job is created in Space 1 and "All spaces" is selected, it will also apply to Space 3 which has a totally different data set. We will be showing ML anomalies detected on apm-foo-* but overlay them on apm-bar-*.

If the user instead creates an ML job in Space 1 and selects "This space" the job will be isolated to Space 1 and the user will have to create an identical job for Space 2 even though the data set is the same. Thus wasting resources, costing them more money, and requiring them to create the same job twice.

Proposal What I propose is that ML jobs are always space agnostic. So when an ML job is created in Space 1 it will be available in Space 2 and Space 3. When the ML job is created it will contain metadata about indicies it is create for. Eg. the ML job created in Space 1 will contain {metadata: {indicies: "apm-foo-*" }}. When retrieving ML jobs only jobs that match the index settings for the current space will be displayed. So the ML job created in Space 1 will be displayed in Space 2 but not in Space 3.

The advantages of this is that implementation details are abstracted away from the user. A job created in one space is automatically available in another space if they have the same index settings, and not in spaces with different index settings.

Let me know if we should zoom on this.

jportner commented 2 years ago

Nit: let's please try not to use the term "space agnostic", that means something else entirely when it comes to Saved Objects (those are specific object types that exist outside of spaces), and using that term here might confuse other folks who read this thread. I'd prefer the term "shared to all spaces" 👍

The advantages of this is that implementation details are abstracted away from the user. A job created in one space is automatically available in another space if they have the same index settings, and not in spaces with different index settings.

That's an interesting idea.

A few things stick out to me:

This means that users would only be able to create ML jobs for APM if they have "global access" to the ML feature, is that what you really want?
If you had already created a job, then you changed the index setting for the current space to some other value, then the job would just disappear from your view. That wouldn't be a great user experience either. Maybe it would be better to provide a filter in the UI, that way the default presentation just shows them the relevant ML jobs, but the user can still find other ML jobs in this space that don't match the configured index (and maybe you render some warning in the UI for it).
Do you envision that this would also show jobs in a subset of indices that match the given pattern? for example, if the current space is configured for index apm-foo-123, would you expect to see an ML job that was created for apm-foo-*? AFAIK it wouldn't be trivial to implement a pattern-matching filter in the basic Saved Objects client, as I don't believe we support wildcard queries yet.

Ultimately I think that "APM ML jobs are always shared to all spaces" is a non-starter, our Product team has indicated that the multi-tenant use case is important and we want to use Spaces to be able to truly isolate tenants from one another.

Alternatively the workflow could could be something like this:

Click the button to create an ML job
Kibana first searches for ML jobs in other spaces that exactly match the configured index
If we find one, we prompt the user to "Share" that existing ML job to the current space instead of creating a new one
Otherwise, we allow the user to create the job only in the current space

This has a few advantages:

No need to ask the user to select "This space" vs "All spaces"
All of this would happen with the end user's credentials, so if there is an ML job in another space that they can't access (as in the multi-tenant use case), they just won't see that another job exists, and they can still create an ML job in the current space.
The "share to space" decision happens in each space where a user with the most context can decide what to do, we aren't arbitrarily sharing ML jobs to other spaces when they are created

WDYT?

sorenlouv commented 2 years ago

This means that users would only be able to create ML jobs for APM if they have "global access" to the ML feature, is that what you really want?

That's a good point. We might have to do that to reduce complexity on our side. However, I hope we can avoid it. Ideally users without global access should still be able to create ML jobs for the spaces they have access to. In that case the ML job would be isolated to those spaces.

If you had already created a job, then you changed the index setting for the current space to some other value, then the job would just disappear from your view.

Changing the index setting should restart the ML jobs with the new settings.

Maybe it would be better to provide a filter in the UI, that way the default presentation just shows them the relevant ML jobs, but the user can still find other ML jobs in this space that don't match the configured index

In general I think we need to try harder not to push complexity onto users. In this case we should just show the relevant ML anomalies - I don't see the benefit in showing a drop down with invalid results.

Do you envision that this would also show jobs in a subset of indices that match the given pattern?

No, for exactly the reasons you mention. It wouldn't be trivial to implement and we should use exact matches only.

Ultimately I think that "APM ML jobs are always shared to all spaces" is a non-starter, our Product team has indicated that the multi-tenant use case is important and we want to use Spaces to be able to truly isolate tenants from one another.

I would love to hear more about this, and what the thinking is around solution teams where the underlying data is inherently space-agnostic (I think I'm using the right term here). In our case APM data lives in data streams which is not space aware, so we have to decide what dimension to partition data by. In our case we were going to simply use the index setting for making APM data space aware. Thus, everything else in APM should follow this model. If we have multiple competing dimensions by which there is space awareness (index settings, ML jobs etc) it can easily become very confusing for the end user.

Alternatively the workflow could could be something like this:

Click the button to create an ML job

Kibana first searches for ML jobs in other spaces that exactly match the configured index

If we find one, we prompt the user to "Share" that existing ML job to the current space instead of creating a new one

Otherwise, we allow the user to create the job only in the current space

I think that could work! Even better if the user is not asked to decide between sharing and creating a new job. This again feels like we are bleeding implementation details through to them. Instead I think, if a matching job exists, it should be shared with the current space when they click "Create".

jportner commented 2 years ago

I would love to hear more about this, and what the thinking is around solution teams where the underlying data is inherently space-agnostic (I think I'm using the right term here). In our case APM data lives in data streams which is not space aware, so we have to decide what dimension to partition data by.

So, Kibana's current authorization model demands that operators assign ES index privileges, ES cluster privileges, and Kibana privileges separately. I can appreciate that this isn't ideal for solutions, though. We have an open issue for supporting "composite features" in Kibana that automatically include the appropriate ES privileges: #96598

In Fleet's case, we are implementing some changes so that you cannot assign the Fleet privilege to a user in specific spaces, you can only assign it in * All spaces (#118001). But Fleet is an odd case and we expect this to be the exception, not the rule. We made that decision fully realizing that this prevents multi-tenant / MSP operators from allowing their users to use Fleet.

Changing the index setting should restart the ML jobs with the new settings.

If the ML job existed in other spaces, though, would it change the settings there too? Or would you expect the user to go back to those other space(s) and create a new ML job?

This idea of tying the ML job to the space's index setting sounds hairy, tbh.

In general I think we need to try harder not to push complexity onto users.

100% agreed! We can always add additional customization in the future if we decide we need it. But it's much harder to take away a feature once we've released it.

I think that could work! Even better if the user is not asked to decide between sharing and creating a new job. This again feels like we are bleeding implementation details through to them. Instead I think, if a matching job exists, it should be shared with the current space when they click "Create".

Sounds reasonable, and that would be easy to do.

These APM ML jobs almost feel like they should be "managed" saved objects, in that they really aren't intended to be changed by the end user -- is that right? Fleet (again) has a similar need, though I can't find an open issue for it, I think it's buried in one of their issues.

peteharverson commented 2 years ago

These APM ML jobs almost feel like they should be "managed" saved objects, in that they really aren't intended to be changed by the end user -- is that right? Fleet (again) has a similar need, though I can't find an open issue for it, I think it's buried in one of their issues.

Note for 8.1 we plan to add a 'Managed' badge for anomaly detection jobs to indicate when they have been deployed and managed by elastic - https://github.com/elastic/kibana/issues/120631. This will include adding warnings when the user edits, deletes, stops or starts these type of jobs.