elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.25k stars 24.86k forks source link

Watcher: Possible areas of future work #34546

Closed spinscale closed 6 months ago

spinscale commented 6 years ago

Fix tests

Right now there are just way too many watch tests broken and issues open regarding that. They need to be triaged and fixed.

Upgrade from 5.x may break when using stored scripts

Due to a change in how stored scripts are parsed in 6.x, a watch stored in 5.x may cease to work after the 6.x upgrade. The issue is described in #33058. This comment describes the result of a discussion, basically doing some form of read-repair. However this approach requires a proper upgrade strategy when going to 7.x.

Storing Execution State as part of the Watch status

This is one of the most asked features. Right now every watch has a stateless execution - it is loaded, executed, a watch history entry gets created and the status gets stored together with the watch. However, if a watch execution would like to know what its previous states were (for example to detect flapping of the condition and only trigger an alert, if the last three conditions were true), than one could query the watch history and extract this from there - if it is stored in there.

It would be really useful, if there was an ability to store some kind of small state within the watch status. This state could contain arbitrary data structures (keys, values, maps, lists), but should not exceed a certain size to keep it reasonable small.

One issue with this would be, when this state should be set - before or after a condition was executed? This could be circumvented by simply setting it within any scripts. So you could have a script condition, that did something like this

"condition" : {
  "script" : """
  def last_total = ctx.state.last_total;
  def last_total_higher = last_total != null && last_total > ctx.payload.hits.total
  ctx.state.last_total = ctx.payload.hits.total;
  return ctx.payload.hits.total > 0 && last_total_higher;
  """
}

When the watch status is stored, we could simply serialize the ctx.state object and put it into the watch status as well, and of course load it on the next execution.

Prevent execution hotspots

Good getting started issue: This is mainly about thread pool/execution management and not too much about watcher.

If you are using the interval based schedule, and you have 1800 watches running every 5m, in theory you have a window of 300 seconds, meaning you could run 6 watches per second, which is not too much. In practice however we currently load all watches at roughly the same time and count 5 minutes from loading, so that an execution hotspot is generated, that can result in watches not being executed because the thread pool queue size has been exceeded.

We could find fancy algorithms to even out the execution, but maybe something more simple like catching the rejected executions and just feeding them back into the trigger service to pick it up for the next execution is already enough.

Silence a watch for some time

Good getting started issue.

The ability to silence the watch for 20 minutes is an asked feature. With the current infrastructure we could implement this using the existing Ack Watch API. One could add and expiry date or time, which then gets stored as a throttle period in the watch status.

More details can be found at #34038

Allow black/whitelisting of URLS the HTTP client used for inputs/actions can access

Good getting started issue.

Watcher uses an own HTTP client that currently can send any data anywhere. Users may like to reduce this by configuring an explicit blacklist and thus reduce the possible endpoints, especially if regular users are allowed to add watches.

More details can be found at #29937

Add support for HTTP basic auth passwords in keystore

Good getting started issue.

For slack/pagerduty/email we can store the passwords of an account in the keystore. For HTTP basic auth however we can only store this in a watch. Maybe we need something like

private static final Setting.AffixSetting<SecureString> SETTING_URL_SECURE =
        Setting.affixKeySetting("xpack.notification.http.", "password",
                (key) -> SecureSetting.secureString(key, null));

This would allow to configure a setting like xpack.notification.http.foo.password. Then all that is needed to do in the watch would be to refer to this password via some way, i.e. ::es_keystore::foo instead of the actual password.

Issue at #30686

Use ILM for history indices

Good getting started issue.

Right now the cleaning up of watch history indices is done by monitoring as the code already existed in there. Luckily we will soon have ILM in core, allowing us to reuse this and fix this issue properly. See #34073

Improve watch parsing

Parsing of watches is in general too lenient and needs to be improved. See #34073, #31852, #29746, #35189

Stop strict parsing of JSON structures sent to 3rd parties

Whenever a 3rd party provider updates their API and adds new fields, we are playing catchup with them. Especially slack is known for adding new features over time. As the watch API parsing is strict and parses all fields, new fields in the upstream API have to be added to the watcher parsing code as well.

I think we should loosen this restriction and just take the JSON as is and sent it over to 3rd parties - ensuring that it is valid JSON seems sufficient to me and will save us development cycles over time.

Related issues are #34073, #31032, #30090

Improve watch index mappings

Metadata fields are searchable. This can end up in conflicts See #30542

The watch history should probably use index time sorting. See #30069 - OTOH this could slow down indexing and increase watch execution times.

UI bugfixes

There are some outstanding issues with the UI, that require fixing and/or discussion.

  1. A watch can accidentally be marked with an error status despite having executed fine #22209
  2. Watcher UI cannot be opened due to watches without a 'to' field in a slack message #20970
  3. Watcher UI: Comment field is never filled up, should be removed? #24001

Prevent executions on single shard due to hashing function

This issue can only happen in certain configurations, but one could end up with watches being executed on one shard only (or the number of primary shards, if it is > 1). See [#32636](https://github.com/elastic/elasticsearch/issues/32636

Monitoring integration

The watches in monitoring need to be examined and fixed. See #33649 and #33051

Run action for each element in an array

Good getting started issue. Requires some sync with PM first.

Allow to run an action for each element of an array, for example for each hit of a search response you may want to create an own slack message.

An action would look like this

"actions": {
  "logme": {
    "foreach" : "ctx.payload.hits.hits"
    "logging": {
      "text": "Id is: {{_id}}"
    }
  }
}

I started a branch at some point but never had the time to keep working on it and add proper tests.

SMTP CA configuration needs to be aligned with other TLS configuration

Configuring TLS for SMTP is different because it is using the javax mail classes. We should align this with the regular security configuration, based on settings. See #30090

Allow to read cluster state metadata in watch payload

A recently added feature in Elasticsearch allowing to store user defined cluster metadata, see #33325. This might be interesting to expose in the watch payload.

Allow to read output of earlier executed actions

The list of actions are executed sequentially. Many customers have a use-case where they want to create a ticket in an external system as the first action and then send a link of that created ticket in a second email. However the current actions run independently from each other, and one cannot access the data generated by any earlier action. We should take a look if we can allow for this by having something like an ctx.payload_actions field.

One potential issue: The order of the actions is not guaranteed, as this is not a list, but a key value mapping. So the order depends on the JSON client inserting the watch.

Have a story around watch deployments

This is also an ongoing discussion item for us. Right now we always assume that watches are just deployed in to the cluster and that's it - but we should take a step back and think how watches can be managed until they hit the cluster. Maybe integrating with Terraform is an interesting thing to do, maybe something else is out there that could help. Integrating into an SCM, plus templating are also thinks that need to be kept in mind. Sometimes you only want to change a part of your query and that's it, similar to what we do with our monitoring watches, where we (very basically) just change the cluster uuid to query for, when several clusters are being monitored.

Reduce heap usage of cron schedules

The data structure used to store the information when a watch should be run is quite a memory hog due to using lots of TreeSet<Integer> structures. There are other cron data structures that use only bitsets and use roughly 3% of what the existing one requires. However those do not support all the features (like run every second friday into the month), but the amount of memory saved might make us go with a hybrid approach here.

Track long running/stuck watches

This is lowest priority. Right now the user has no indication of long running watches, except using the watcher stats api and check out the stack traces manually. If a watch exposes itself as a Task, than the tasks API could be used. This might also be useful to track long running watches, instead of writing own code.

Parse only watch status in IndexingListener

This is lowest priority. Currently a watch is parsed twice when being stored. Once on the coordinating node receiving the watch and once in the WatcherIndexingListener in order to be scheduled. At this point we basically only need to access the watch status to check if it is active and thats it. There is no need to parse the whole watch again and try to compile its scripts for example.

Prevent storing of unbounded watch records

See #35997

Refactor out duplicate classes with TextTemplates

We currently have a ton of classes that have a TextTemplate backed version of all a classes methods, and another version with the actual string/int accessors. Much of this can be cleaned up and deduplicated, and only text templates are used at render time.

elasticmachine commented 6 years ago

Pinging @elastic/es-core-infra

nithril commented 5 years ago

Not sure it is the right place, btw my wishlist is:

lucabelluccini commented 4 years ago

Foreach&Transform inconsistency in actions

When a transform is specified within an action, the foreach will access to the original context instead of the result of the transformation. The workaround is to make the transform store the result inside the ctx instead of using return <something>.

It is inconsistent because if there is no foreach, the action works on the result of the transform.

POST _watcher/watch/_execute
{
  "watch": {
    "trigger": {
      "schedule": {
        "interval": "5m"
      }
    },
    "input": {
      "simple": {
        "mydocs": [
          "a",
          "b",
          "c",
          "d",
          "e"
        ]
      }
    },
    "condition": {
      "always": {}
    },
    "actions": {
      "logging": {
        "transform": {
          "script": "return [ 'mydocs': ctx.payload.mydocs.stream().map(e -> e + 'transformed').collect(Collectors.toList()) ];"
        },
        "foreach": "ctx.payload.mydocs",
        "logging": {
          "text": "{{ctx.payload}}"
        }
      }
    }
  }
}

Result:

    "actions" : [
        {
          "id" : "logging",
          "type" : "logging",
          "status" : "success",
          "transform" : {
            "type" : "script",
            "status" : "success",
            "payload" : {
              "mydocs" : [
                "atransformed",
                "btransformed",
                "ctransformed",
                "dtransformed",
                "etransformed"
              ]
            }
          },
          "number_of_actions_executed" : 5,
          "foreach" : [
            {
              "logging" : {
                "logged_text" : "{_value=a}"
              }
            },
            {
              "logging" : {
                "logged_text" : "{_value=b}"
              }
            },
            {
              "logging" : {
                "logged_text" : "{_value=c}"
              }
            },
            {
              "logging" : {
                "logged_text" : "{_value=d}"
              }
            },
            {
              "logging" : {
                "logged_text" : "{_value=e}"
              }
            }
          ],
          "max_iterations" : 100
        }
      ]

Workaround:

    "actions": {
      "logging": {
        "transform": {
          "script": "ctx.payload.mydocs = ctx.payload.mydocs.stream().map(e -> e + 'transformed').collect(Collectors.toList());"
        },
        "foreach": "ctx.payload.mydocs",
        "logging": {
          "text": "{{ctx.payload}}"
        }
      }
    }

After checking the code at, I think what happens is the following:

  1. A local variable payload is instantiated and initialised with ctx.payload() before any transform at https://github.com/elastic/elasticsearch/blob/8109a2cbfd5bdd418d6f7197e309f5cd360a2bc0/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/watcher/actions/ActionWrapper.java#L142
  2. If the transform is successful, the result of it ends up in payload at https://github.com/elastic/elasticsearch/blob/8109a2cbfd5bdd418d6f7197e309f5cd360a2bc0/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/watcher/actions/ActionWrapper.java#L153
  3. If a foreach path is specified, the execution works on the original ctx instead of the payload variable at https://github.com/elastic/elasticsearch/blob/8109a2cbfd5bdd418d6f7197e309f5cd360a2bc0/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/watcher/actions/ActionWrapper.java#L173
Bernhard-Fluehmann commented 3 years ago

@spinscale Please make the executing elasticsearch node name available as a ctx variable. This would be helpful for debugging.

As well it would be nice to have a dedicated watcher node role to control the nodes on which watches are allowed to run.

spinscale commented 3 years ago

Please see https://www.elastic.co/blog/distributed-watch-execution-elasticsearch-6.0 - you can use shard allocation filtering for dedicated watch execution nodes, and the node_id is stored in the watch history. Hope this helps!

Bernhard-Fluehmann commented 3 years ago

@spinscale Thank you about the shard allocation hint. This is very helpful.

The problem of the node_id is that I would need it in the context, i.e. use it in the content sent by the watch. Is there any solution for this available? I have tried to get it directly with painless, like e.g. get the hostname system variable, but without success.

Bernhard-Fluehmann commented 3 years ago

@spinscale One more thing. I have tried to use metadata variables in search inputs to define the index and set params without success. Am I doing something wrong or is this not supported? The goal is to have all variables of a watch defined in metadata and use them anywhere in the watch. This would greatly simplify watcher design and improve readability.

dakrone commented 6 months ago

This has been open for quite a while, and we haven't made much progress on this due to focus in other areas. For now I'm going to close this as something we aren't planning on implementing. We can re-open it later if needed.