Watcher executes on a node after the .watches shard has been moved to a different node

masseyke commented 4 months ago

Elasticsearch Version

8.8.2 (but likely others as well)

Installed Plugins

No response

Java Version

bundled

OS Version

n/a

Problem Description

This happens only rarely, but I have seen evidence of it twice in the same cluster a few days apart. As part of normal shard reallocation, a .watches shard gets moved off of a node (I'll call it nodeA) and onto another node (nodeB). This ought to mean that nodeA stops running watches. And we see in nodeA's logs:

paused watch execution, reason [no local watcher shards found], cancelled [0] queued tasks

But if we search .watcher-history-*, we see that now both nodeA and nodeB are executing the same watch, at nearly the same time on the same schedule. So instead of getting executed once every 10 minutes (for example), the watch gets executed twice every 10 minutes. Aside from the message above, I haven't seen anything relevant in the logs. Restarting the nodes solves the problem.

Steps to Reproduce

Unknown

Logs (if relevant)

No response

elasticsearchmachine commented 4 months ago

Pinging @elastic/es-data-management (Team:Data Management)

masseyke commented 4 months ago

Also, my initial thought was just that since watcher history records are written asynchronously, there could be a little lag. But it goes on for days until the cluster is restarted, with the same timestamps as the watcher history records from nodeB.

masseyke commented 2 months ago

I still don't understand this, but I've been able to manually reproduce it by just artificially slowing down the pause logic. I'm writing the steps here since it's late on a Friday and I don't want to forget:

I started a 3-node, single master cluster
I created a watch that polls every minute (I just used the watch from https://www.elastic.co/guide/en/elasticsearch/reference/current/watcher-api-put-watch.html#watcher-api-put-watch-example)
I put debugger breakpoints in various places in the pause logic -- in TickerScheduleTriggerEngine::pauseExecution and WatcherService::pauseExecution
I had .watches shards on node-1 and node-2
I hooked the debugger into node-2
I used the _cluster/reroute API to move the .watches shards from node-2 to node-3
I caught the breakpoints, looked around for a few seconds, and then let them go
Now both node-2 and node-3 are running the single watch!

Relevant curl commands: Create the watch:

curl -X PUT "localhost:9200/_watcher/watch/my-watch?pretty" -H 'Content-Type: application/json' -d'
{
  "trigger" : {
    "schedule" : { "cron" : "0 0/1 * * * ?" }
  },
  "input" : {
    "search" : {
      "request" : {
        "indices" : [
          "logstash*"
        ],
        "body" : {
          "query" : {
            "bool" : {
              "must" : {
                "match": {
                   "response": 404
                }
              },
              "filter" : {
                "range": {
                  "@timestamp": {
                    "from": "{{ctx.trigger.scheduled_time}}||-5m",
                    "to": "{{ctx.trigger.triggered_time}}"
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "condition" : {
    "compare" : { "ctx.payload.hits.total" : { "gt" : 0 }}
  },
  "actions" : {
    "email_admin" : {
      "email" : {
        "to" : "admin@domain.host.com",
        "subject" : "404 recently encountered"
      }
    }
  }
}
'

Move the shards:

curl -X POST "localhost:9200/_cluster/reroute?metric=none&pretty" -H 'Content-Type: application/json' -d'
{
  "commands": [
    {                      
      "move": {
        "index": ".watches", "shard": 0,
        "from_node": "node-2", "to_node": "node-3"
      }
    }
  ]
}
'

Check watcher history:

curl localhost:9200/.watcher-history-*/_search?size=100  -H 'Content-Type: application/json' -d'
{
  "sort": [
    { "@timestamp": "desc"}
  ]
}
' | python3 -mjson.tool | grep node

Also kind of interesting -- several times after the problem began a cluster state change has come through and WatcherLifecycleService::pauseExecution is called. If it were to call watcherService.pauseExecution(reason); it would most likely stop the errant watch on node-2. But since there are no allocation ids, clearAllocationIds() returns false, so it never calls watcherService.pauseExecution(reason).

So it seems to be a race condition in the pause logic. Something seems to be calling TickerScheduleTriggerEngine::add after TickerScheduleTriggerEngine::pauseExecution runs.

masseyke commented 2 months ago

Here's how the race condition works: Say I pause watcher because the shards have been allocated off of this machine. This happens more frequently due to the new shard allocator in 8.6.0. So TickerScheduleTriggerEngine.pauseExecution() calls schedules.clear(). But while pause was running, there was a regularly-scheduled watch running. After pause completes, we get the result back, and thanks to WatcherIndexingListener, we make sure to add the watch to the trigger service. TickerScheduleTriggerEngine.add sees that this watch is not in schedules so it helpfully puts it in there for us. So now this watch is running on this machine, and whatever machine the shard got allocated to!

masseyke commented 1 month ago

I have tried to automate this race condition into a test, with no luck so far.

First I tried writing an AbstractWatcherIntegrationTestCase for it, before realizing that the class that I believe has the race condition (TickerScheduleTriggerEngine) is mocked out in that test with ScheduleTriggerEngineMock.

Then I tried writing an ESIntegTestCase briefly, before realizing that there was probably a reason so much was mocked in AbstractWatcherIntegrationTestCase -- I was unable to get the server to start up with Watcher running TickerScheduleTriggerEngine.

So I wrote an ESRestTestCase, based on SmokeTestWatcherTestSuiteIT. I brought up a 5-node cluster. I created a watch that executes every 10ms (which required a bit of a hack since we normally prevent that). I reallocate one of the two .watches shards constantly -- as soon as I find that it has finished reallocating, I move it again. I do this a few dozen times (all while the watch is running over and over). Then I look at the watch history, and wait until the most recent 10 entries have all run on the same node (if we hit the race condition I'd expect the watch to be running on 2 or more nodes). Unfortunately the test succeeds every single time. I also artificially slowed down the watch, hoping to make the race condition more likely. No luck.

elastic / elasticsearch