Remove dead node from solr

datadavev commented 11 months ago

Despite removal of the UNM nodes and all associated configuration, there remains a "gone" node entry in the solr cloud. This causes spurious error messages to appear in the solr logs and although does not appear to have othe impacts, should be cleaned up.

Listing the nodes can be done with the CLUSTERSTATUS command, e.g.:

http://localhost:8983/solr/admin/collections?wt=json&action=CLUSTERSTATUS

# Times out on production cn

or using zkcli:

/var/solr/server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:2181 -cmd list

...
DATA:
       {"event_core":{
           "shards":{"shard1":{
               "range":"80000000-7fffffff",
               "state":"active",
               "replicas":{
                 "core_node1":{
                   "core":"event_core_shard1_replica2",
                   "base_url":"http://10.10.1.3:8983/solr",
                   "node_name":"10.10.1.3:8983_solr",
                   "state":"down"},
                 "core_node2":{
                   "core":"event_core_shard1_replica1",
                   "base_url":"http://207.71.230.213:8983/solr",
                   "node_name":"207.71.230.213:8983_solr",
                   "state":"active"},
                 "core_node3":{
                   "core":"event_core_shard1_replica3",
                   "base_url":"http://128.111.85.180:8983/solr",
                   "node_name":"128.111.85.180:8983_solr",
                   "state":"active",
                   "leader":"true"}}}},
           "replicationFactor":"3",
           "router":{"name":"compositeId"},
           "maxShardsPerNode":"1",
           "autoAddReplicas":"false"}}
   /collections/event_core/leader_initiated_recovery (1)
    /collections/event_core/leader_initiated_recovery/shard1 (0)
   /collections/event_core/leader_elect (1)
    /collections/event_core/leader_elect/shard1 (1)
     /collections/event_core/leader_elect/shard1/election (2)
      /collections/event_core/leader_elect/shard1/election/111090172149301248-core_node3-n_0000001320 (0)
      /collections/event_core/leader_elect/shard1/election/111090245812551680-core_node2-n_0000001321 (0)
  /collections/search_core (4)
  DATA:
      {"configName":"search_core"}
   /collections/search_core/leaders (1)
    /collections/search_core/leaders/shard1 (0)
    DATA:
        {
          "core":"search_core_shard1_replica1",
          "base_url":"http://128.111.85.180:8983/solr",
          "node_name":"128.111.85.180:8983_solr"}
   /collections/search_core/state.json (0)
   DATA:
       {"search_core":{
           "shards":{"shard1":{
               "range":"80000000-7fffffff",
               "state":"active",
               "replicas":{
                 "core_node1":{
                   "core":"search_core_shard1_replica3",
                   "base_url":"http://10.10.1.3:8983/solr",
                   "node_name":"10.10.1.3:8983_solr",
                   "state":"down"},
                 "core_node2":{
                   "core":"search_core_shard1_replica2",
                   "base_url":"http://207.71.230.213:8983/solr",
                   "node_name":"207.71.230.213:8983_solr",
                   "state":"active"},
                 "core_node3":{
                   "core":"search_core_shard1_replica1",
                   "base_url":"http://128.111.85.180:8983/solr",
                   "node_name":"128.111.85.180:8983_solr",
                   "state":"active",
                   "leader":"true"}}}},
           "replicationFactor":"3",
           "router":{"name":"compositeId"},
           "maxShardsPerNode":"1",
           "autoAddReplicas":"false"}}
   /collections/search_core/leader_initiated_recovery (1)
    /collections/search_core/leader_initiated_recovery/shard1 (0)
   /collections/search_core/leader_elect (1)
    /collections/search_core/leader_elect/shard1 (1)
     /collections/search_core/leader_elect/shard1/election (2)
      /collections/search_core/leader_elect/shard1/election/111090172149301248-core_node3-n_0000001260 (0)
      /collections/search_core/leader_elect/shard1/election/111090245812551680-core_node2-n_0000001261 (0)
 /clusterstate.json (0)

Removing a replica is done with the DELETEREPLICA command.

http://localhost:8983/solr/admin/collections?wt=json
&action=DELETEREPLICA
&collection=event_core
&shard=shard1
&replica=core_node1
&deleteInstanceDir=false
&deleteDataDir=false
&deleteIndex=false

This operation worked on stage (with parameter adjustment, there the gone node was node2), but times out on production.

datadavev commented 11 months ago

One suggested fix for CLUSTERSTATUS timeout is to completely shutdown solr and zookeeper on all nodes, then restart.

This process takes a couple minutes for solr to come back up and so needs to be scheduled for a convenient time on production.

datadavev commented 11 months ago

Was able to adjust the nodes using zkCli.sh as follows:

/var/solr/server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:2181 -cmd getfile /collections/search_core/state.json search_core_state.json

edit search_core_state.json, then:

/var/solr/server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:2181 -cmd putfile /collections/search_core/state.json search_core_state.json

Repeated for the event core. This seemed to be successful, with the cloud view in admin being much more responsive and showing the correct configuration.

datadavev commented 11 months ago

There still remains the issue of clusterstatus timeout. The procedure for repairing that is likely:

Shut down solr on both ucsb and orc
delete all the entries under /overseer/collection-queue-work/ in zk
start up solr

Small python script for deleting the zk entries:

import kazoo.client

def main():
    host = "127.0.0.1:2181"
    root = "/overseer/collection-queue-work/"
    zk = kazoo.client.KazooClient(hosts=host)
    zk.start()
    try:
        for key in zk.get_children(root):
            if key.startswith("qn-"):
                _path = f"{root}{key}"
                print(f"deleting {_path}")
                #zk.delete(_path)
    finally:
        zk.stop()

if __name__ == "__main__":
    main()

datadavev commented 11 months ago

After purging the pending clusterstatus requests from zk and restarting solr and zk on each node, normal operations have been restored. It was necessary to kill the zk process on orc as it was unresponsive to shutdown requests.

DataONEorg / DataONE_Operations

Remove dead node from solr #16