elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.4k stars 24.56k forks source link

Enrich processor: allow scheduling of policy executions #50071

Open tahaderouiche opened 4 years ago

tahaderouiche commented 4 years ago

Before an enrich processor can be used, an enrich policy must be executed. When executed, an enrich policy uses enrich data from the policy’s source indices to create a streamlined system index called the enrich index. The execution is executed manually by running PUT /_enrich/policy/my-policy/_execute, giving the user control on when the new data becomes part of the enriching policy.

In cases where the policy’s source indices are constantly changing, the policy execution can also be scheduled.

Having the ability to schedule (say daily, hourly) the execution natively in elasticsearch would make it more approachable and would benefit this case of a constantly changing source indices.

elasticmachine commented 4 years ago

Pinging @elastic/es-core-features (:Core/Features/Ingest)

philippkahr commented 3 years ago

I would like to add my two cents here. Let's assume the following example with currencies and exchange.rates

my enrich processors matches on pair and sets the exchange.rate I have the following indices

index: exchange-rate
contains {pair : EURO-DOLLAR and exchange.rate: 1.1}
are updated all 30 minutes using filebeat.
index: latest-exchange-rate
contains: {pair : .... } as above
are updated every hour using transforms.
it only includes the latest values for `pair`
index: currencies
contains { pair: EURO-DOLLAR }
which uses an ingest pipeline with an enrich to populate the `exchange.rate` value.

my enrich policy append-exchange-rate

{
  "match": {
    "indices": "latest-exchange-rate",
    "match_field": "pair",
    "enrich_fields": ["exchange.rate"]
  }
}

Since I am using transforms to dynamically build a small index, I would expect the enrich processor to pick up the change to the original index and do the POST /_enrich/policy/append-exchange-rate/_execute automagically.

Personally this would definitely be needed for indices that have frequent changes and/or transforms that run often. I know that enrich policy building is costly.

Scheduling the policy update alone is definitively a nice feature, but there might be a window where the old data is already refreshed but the policy update has not run yet. Thus the documents are getting populated with old and maybe wrong data.

ar-mi commented 2 years ago

Hey! Have any work already been done on this?

puppylpg commented 2 years ago

It's been year 2022 :smile_cat: issue still open

smnschneider commented 1 year ago

Would be great to get this feature!

leandrojmp commented 1 year ago

Hello, is there any update on this feature? Is this still being considered?

timor-raiman commented 1 year ago

+1 Alternatively - #58925

Rick25-dev commented 1 year ago

+1

Rick25-dev commented 1 year ago

+1

kossde commented 1 year ago

I was able to work around this issue by creating a watcher that performs an http call to the cluster on a scheduled interval and re-executes the enrich policy.

Initially when I did this, I was running into trouble with the execution failing periodically but that ended up being related to the auto_expand_replicas settings of the .enrich indices and our high disk utilization on one node. To get around that, I created an index template for .enrich indices and turned off auto_expand_replicas and setting the replica count to 1. The auto-execution now works like a charm!

leandrojmp commented 1 year ago

@kossde hello, can you share the watcher json configuration that you used?

smnschneider commented 1 year ago

@leandrojmp: In the mean time i use a watcher like this. This is for using it wih ECE, but can easily changed for the use with different deployment methods.

{
  "trigger": {
    "schedule": {
      "interval": "1d"
    }
  },
  "condition" : {
    "always" : {}
  },
  "actions": {
    "webhook-execute_enrich_policy": {
      "webhook": {
        "scheme": "https",
        "host": "1.2.3.4",
        "port": 9243,
        "method": "PUT",
        "path": "/_enrich/policy/<enrich-policy>/_execute",
        "params": {
            "wait_for_completion": "false"
        },
        "headers": {
          "X-found-cluster": "<cluster-id>"
        },
        "auth": {
          "basic": {
            "username": "<enrich_executer>",
            "password": "<enrich_executer_password>"
          }
        }
      }
    }
  }
}
leandrojmp commented 1 year ago

Thanks @smnschneider!

It is pretty similar to the one I was testing, but could not use in production yet because it would require a restart of the nodes to apply the http certificate configuration.

In the end I'm using a simple script on crontab.

kossde commented 1 year ago

The watcher I wrote ended up being very similar as well. We now have several different enrich policies running in our environment; much of which are periodically executed via watcher as new values come into the indices. These policies are so useful… it eludes me as to why there isn’t an easier way to auto-execute them.

Anyway, I feel personally that there needs to be a way to apply trusted CA updates without whole cluster reboots. Maybe they could add an option to reapply the elasticsearch yaml or, at least parts of it, without forcing a full service restart. Surely it can’t be that difficult to set up a sort of configuration that lets us split the yaml into multiple files, some of which can be reloaded upon demand..?

kossde commented 11 months ago

On version 7.x I was able to do it by increasing the priority of the index template. In 8.x, though, the template refuses to apply. I was attempting to do the same thing you are by decreasing number of replica shards. I don’t have a solution to this, though, as even when I do get the template to take, it reverts back back to placing shards on all nodes as soon as the enrichment policy re-executes.

On Wed, Sep 20, 2023 at 1:50 AM bil151515 @.***> wrote:

Hello @kossde https://github.com/kossde How do you manage to apply index template to the enrich indices? I tried to a template like this but it won't apply to .enrich-* { "template": { "settings": { "index": { "lifecycle": { "name": "enrichment" }, "routing": { "allocation": { "include": { "_tier_preference": "data_content" } } }, "auto_expand_replicas": "false", "number_of_replicas": "1" } }, "aliases": {}, "mappings": {} } }

— Reply to this email directly, view it on GitHub https://github.com/elastic/elasticsearch/issues/50071#issuecomment-1727263809, or unsubscribe https://github.com/notifications/unsubscribe-auth/A7M5PUQIXXXK57GK4KUC6Z3X3KU5JANCNFSM4JZMTFCQ . You are receiving this because you were mentioned.Message ID: @.***>

elasticsearchmachine commented 7 months ago

Pinging @elastic/es-data-management (Team:Data Management)

matabar commented 7 months ago

+1

carlopuri commented 7 months ago

+1 I've different situations where the opportunity to schedule an auto policy execution will solve lot of management processes doing by human (me...). I've different policies to run and maintain, but having this task to run manually, it's over complicating a simple and well designed process like the index enrichment.... Please add this feature

dnegrescu commented 7 months ago

+1

clement-fouque commented 6 months ago

Scheduling would be great a feature. Additionally, we could incorporate a continuous execution function, similar to the one in transform. Although there may be some compromises, this would be particularly useful for small datasets.

The ability of ESQL to enrich at query time further emphasizes the need for this feature.

This suggestion could be associated with the partial update requests mentioned in the following issues:

supu2 commented 4 months ago

+1 Pinging @elastic/es-data-management (Team:Data Management)

Requium commented 4 months ago

+1

dominicbirch commented 4 months ago

+1

morgan-atwood commented 3 weeks ago

Is there any roadmap of when or if this feature would be available? This could help solve some headaches we're having with managing watchers to simply update all our enrich policies. All our source indexes are being update constantly and having the enrich policy refresh it's index with the source every 1h/1d/1w would be very helpful.

webbersharhan commented 3 weeks ago

+1