elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.15k stars 24.84k forks source link

AutoFollowCoordinator keeps running after master election #90159

Open idegtiarenko opened 2 years ago

idegtiarenko commented 2 years ago

Elasticsearch Version

all versions with auto follow feature

Problem Description

Master fail-over is not handled in AutoFollowCoordinator. Old master will keep polling leader cluster for new indices matching the patterns.

Steps to Reproduce

Expected result

Actual result

Logs (if relevant)

Multiple repeating entries like one below on a follower node that is no longer elected as a master:

Error occured while cleaning followed leader indices
org.elasticsearch.cluster.NotMasterException: no longer master, failing [update_auto_follow_metadata]

Leader cluster would have multiple poll cluster tasks running

cluster:monitor/state                          mo5JrIs7Q9SXmV2gULkJ3Q:461449263 -                                transport  1663660577559 07:56:17  13.5s        10.46.88.208 instance-0000000001
cluster:monitor/state                          mxxRAHr_Tiik33tXfancsw:322295710 -                                transport  1663660578646 07:56:18  12.4s        10.46.88.207 instance-0000000000
cluster:monitor/state                          mo5JrIs7Q9SXmV2gULkJ3Q:461449305 mxxRAHr_Tiik33tXfancsw:322295710 transport  1663660578647 07:56:18  12.4s        10.46.88.208 instance-0000000001
cluster:monitor/state                          mo5JrIs7Q9SXmV2gULkJ3Q:461449306 -                                transport  1663660578716 07:56:18  12.3s        10.46.88.208 instance-0000000001
cluster:monitor/state                          mo5JrIs7Q9SXmV2gULkJ3Q:461449313 -                                transport  1663660579175 07:56:19  11.9s        10.46.88.208 instance-0000000001

In case a new matching index is created in the leader cluster then duplicate PutFollowAction would be issued (from old master and newly elected master). One of them will fail and will record following failure in GET /_ccr/stats in recent_auto_follow_errors:

      {
        "leader_index": "leader_cluster:my-index-1",
        "timestamp": 1662034484876,
        "auto_follow_exception": {
          "type": "snapshot_restore_exception",
          "reason": "[_ccr_leader_cluster:_latest_/_latest_] cannot restore index [my-index-1] because an open index with same name already exists in the cluster. Either close or delete the existing index or restore the index under a different name by providing a rename pattern and replacement name"
        }

Workaround

Restart old master

elasticsearchmachine commented 2 years ago

Pinging @elastic/es-distributed (Team:Distributed)