elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.48k stars 8.04k forks source link

[ML] Add cache for module recognize endpoint #136190

Open jgowdyelastic opened 1 year ago

jgowdyelastic commented 1 year ago

Calling the module recognize endpoint ml/modules/recognize can be expensive as it runs all queries in all manifest files over the supplied index pattern. To speed this up we could cache the results in a saved object and return these on subsequent calls. We could periodically refresh these cached results in a background task and/or delete them if a long time has passed since they were created.

The security alerts UI has an ML menu which calls recognize every time it is opened. So using cached results would greatly reduce the load on Elasticsearch if this menu is opened frequently. image

elasticmachine commented 1 year ago

Pinging @elastic/ml-ui (:ml)

mjraa commented 1 year ago

Hi @jgowdyelastic ,

In #119635 you mentioned:

When the menu is opened we need to determine which jobs are applicable to the indices used by the security alerts.

May I ask why is this really needed? Just improved user experience? These queries can be quite expensive, that's why I am wondering if this is really needed.

We could periodically refresh these cached results in a background task and/or delete them if a long time has passed since they were created.

Can they at least be executed only when a user clicks a button to "check if there is matching data"? These queries can be quite expensive, depending on the data volume. This gets worse since the same data view is shared across different pages in the Security app (not sure if a bug or by design. Relevant: #136188).

I am really trying to understand why is this even needed. The only downside I can see by removing the queries from the manifest files, is that a user will not be able to import them directly in Kibana. For organizations that have the goal of having everything as code this is not really an issue :). In fact, it helps overcoming the challenges with updating/overriding ML based detection rules (#58720).

It would be great if we could disable this behavior.

jgowdyelastic commented 1 year ago

Hi @mjraa Without running the check to see which jobs are applicable to the data, the user would have no idea which jobs they can use. The security modules contain 47 jobs and only a small subset of those might be applicable to the users data.

It's worth pointing out that I am only referring to the request to ml/modules/recognize that is sent when the ML job settings menu is opened. I am not sure why an initial request is sent to ml/modules/recognize when the page is first loaded. @elastic/security-solution would have a better idea why this initial request is sent.

mjraa commented 1 year ago

Hi @jgowdyelastic ,

It is understandable when seen from the user experience perspective. But I am afraid Elastic is not considering the down side of having expensive queries being executed. This can cause big problems, like the one reported in #119635.

Why not:

I just feel these type of optimizations are important when considering large clusters. Another example, before #119635 the queries were hitting the frozen tier, which assumes data is rarely queried.

In our case at least, we are fine with checking the documentation and the repo to understand if a ML job is relevant, if that is what it takes to not produce these queries (by removing them from the manifest files). Being able to disable this behavior would be ideal, though.

But please consider this. This behavior may be ok for small clusters, but not for large ones. Especially when other issues (#136188) make things even worse.

jgowdyelastic commented 1 year ago

Hi @mjraa

I appreciate that the current behaviour is not ideal for large clusters, and we should make changes to improve performance.

  • have a button for the user to click to do the checking (maybe the fields needed can be exposed in the UI?)
  • add a time range filter, since the documentation refers machine learning jobs look back and analyze two weeks of historical data
  • cache the response during the interval specified in time range filter

These suggestions all seem reasonable to me. As part of this work we can look at incorporating these optimisations.