Open ansell opened 6 years ago
An example of running it manually a few minutes ago after an outage reconciled itself with only one monit restart showed 4 of the non-restarted nodes (those not mentioned in alreadyLeaders
) had lost their leader status. This is an issue due to the extra future load put on the already underresourced solr nodes that took the leader status from those nodes:
# cat /tmp/rebalanceleaders-output.json
{
"responseHeader":{
"status":0,
"QTime":433},
"alreadyLeaders":[
"core_node3",[
"status","success",
"msg","Already leader",
"shard","shard3",
"nodeName","aws-sc3b.ala:8983_solr"],
"core_node5",[
"status","success",
"msg","Already leader",
"shard","shard5",
"nodeName","aws-sc5b.ala:8983_solr"],
"core_node6",[
"status","success",
"msg","Already leader",
"shard","shard6",
"nodeName","aws-sc6b.ala:8983_solr"],
"core_node8",[
"status","success",
"msg","Already leader",
"shard","shard8",
"nodeName","aws-sc8b.ala:8983_solr"]]}
The solr cloud becomes unbalanced without nodes being restarted by monit. Another mechanism such as cron is needed to regularly call the solr rebalance script to ensure all nodes are being targeted in a balanced manner.