Unassigned shards - Githubissues

arctica commented 10 years ago

I have set up an ELK system to monitor some webserver accesslogs and other stuff. Started out with 1 webserver running logstash-forwarder and 1 server running logstash and elasticsearch. Then I added 2 other webservers also running logstash-forwarder. All seemed fine.

But as soon as I added another server with logstash + elasticsearch, I started to see unassigned shards, specifically replicas. I have the default of 5 shards and 1 replica.

Every day a few replica shards don't seem to get assigned to the other node. Logstash is version 1.4.2-1-2c0f5a1 Elasticsearch is version 1.3.1 They were installed using the official debian repositories from elasticsearch.org

I was not able to spot any errors or other messages related to shard failures in the log on any of the ES nodes. Unassigned shards not only happen for logstash indices but also for marvel ones.

Right now the cluster is operational but as soon as I get a hardware failure on one of the nodes I'll encounter dataloss.

I tried to set replicas to 0 and cluster went to green. Set it back to 1 and the exact same shards are unassigned once again and cluster stays in yellow. I use the HQ plugin to monitor cluster state.

What could be the cause for unassigned shards and how can I prevent this from happening? I have found quite a bunch of people having the same issue via google with various approaches how to solve this problem but nothing conclusive. Restarting any of the nodes, nor bringing the whole cluster down and back up helped.

clintongormley commented 10 years ago

What settings do you have in your config/elasticsearch.yml files, and returned by:

curl localhost:9200/_cluster/settings
curl localhost:9200/_settings

arctica commented 10 years ago

The config/elasticsearch.yml file can be seen in this gist The only changes were the clustername, node name, discovery hosts and the last two lines for tuning.

# curl localhost:9200/_cluster/settings
{"persistent":{},"transient":{}}

# curl localhost:9200/_settings (gist as it is a bit longer)

An interesting thing I noticed is that shards 0-3 have the primary on server A and a replica on server B. Then shard 4 has a primary on server B but no replica on server A. This seems to be the case for all logstash indices created recently.

clintongormley commented 10 years ago

Hi @arctica

I don't see bad settings. The reason why replicas are not assigned must be logged - we need to see those logs in order to debug further. Also, you can increase the recovery logging with this request:

PUT /_cluster/settings
{
  "transient": {
    "logger.indices.recovery": "DEBUG"
  }
}

arctica commented 10 years ago

I changed the setting as you described and restarted server B to trigger recovery. I could not find any messages about failed recoveries. I saw a lot of messages like

... [logstash-2014.08.24][2] recovery completed from ...

But none for [4], which is unassigned. For [0] to [3] I saw also on server A this message:

... delaying recovery of [logstash-2014.08.24][2] ...

In theory I should see on server B the delaying message for the shard in question and the completed message on server A. But there is just nothing :(

A bit off topic but once I brought server B back up, I saw a lot of network and disk activity during recovery. Is it normal that B fetches a lot (all?) data from server A? The recovery took also a bit longer than expected (15-20min?)

clintongormley commented 10 years ago

If your shards on B have diverged from A (which happens over time) then they need to copy all segments across, which is normal. The delay happening on the source node is also normal, as too many recoveries happening at the same time would kill I/O.

You can check on the progress of recovery with:

GET /_cat/recovery?v

So have your nodes recovered now, or what is happening?

arctica commented 10 years ago

Hi Clinton,

the cluster went into the same state (yellow) as before with a bunch of replicas unassigned. Here the output of /_cat/recovery?v:

logstash-2014.08.24 0     37510 replica done  A B n/a        n/a      185   97.8%         378937306 100.0%        
logstash-2014.08.24 0     167   gateway done  A A n/a        n/a      0     0.0%          0         0.0%          
logstash-2014.08.24 1     27762 replica done  A B n/a        n/a      173   100.0%        379386605 100.0%        
logstash-2014.08.24 1     159   gateway done  A A n/a        n/a      0     0.0%          0         0.0%          
logstash-2014.08.24 2     22048 replica done  A B n/a        n/a      173   99.4%         377834450 100.0%        
logstash-2014.08.24 2     171   gateway done  A A n/a        n/a      0     0.0%          0         0.0%          
logstash-2014.08.24 3     40162 replica done  A B n/a        n/a      182   100.0%        382053862 100.0%        
logstash-2014.08.24 3     164   gateway done  A A n/a        n/a      0     0.0%          0         0.0%          
logstash-2014.08.24 4     235   gateway done  B B n/a        n/a      187   100.0%        376405639 100.0%

Note how there are 2 entries for shards 0, 1, 2 and 3 but only one for shard 4. Consistent with the missing of any log entries for that replica. It's as if the replica does not exist at all.

Since logstash creates an index every day, shouldn't old shards not diverge? They are not modified once the day is over and transfering all that data around when restarting a node seems a bit wasteful especially once I reached hundreds of gigabytes in logs. Even with gigabit ethernet that's gonna take a good while.

clintongormley commented 10 years ago

Since logstash creates an index every day, shouldn't old shards not diverge? They are not modified once the day is over and transfering all that data around when restarting a node seems a bit wasteful especially once I reached hundreds of gigabytes in logs. Even with gigabit ethernet that's gonna take a good while.

Documents are indexed on the primary and an the replica independently, and refreshes happen at different times, which is why segments diverge. #6069 is about improving that.

Note how there are 2 entries for shards 0, 1, 2 and 3 but only one for shard 4. Consistent with the missing of any log entries for that replica. It's as if the replica does not exist at all.

That is very weird! Could you also provide me the output of this command:

curl -XGET "http://localhost:9200/_cluster/state/routing_table,routing_nodes/logstash-2014.08.24"

arctica commented 10 years ago

Great to hear recovery is being made fast. It was not a big concern for me right now but would probably cause issues later on as we grow our data in size.

Here is the output of the command you asked for in a gist

Shard 4 has a working primary but the replica has null values for both "node" aswell as "relocating_node". To me it seems as if that replica never got assigned in the first place and even recovery does not attempt to assign it.

clintongormley commented 10 years ago

Ah OK - so it is just unassigned, not missing. OK, what's the output of this command:

 curl -XGET "http://localhost:9200/_cluster/health/?level=indices"

arctica commented 10 years ago

Please find the output in this gist Indeed the replica is unassigned and not missing. And you can also clearly see it is always the case for one shard per logstash index/day.

clintongormley commented 10 years ago

@arctica another question - how much free space do you have on your disks?

arctica commented 10 years ago

26GB data so far, 1.7TB still free on each node. Or roughly a 2% usage. There is also plenty of RAM free, no swapping going on.

clintongormley commented 10 years ago

OK, could you try this command and post the output:

curl -XPOST "http://localhost:9200/_cluster/reroute?explain" -d'
{
  "commands": [
    {
      "allocate": {
        "index": "logstash-2014.08.24",
        "shard": 4,
        "node": "N_HxeTu8StmMR_6sTk2faQ"
      }
    }
  ]
}'

arctica commented 10 years ago

Clinton: I think you pointed at the solution, great! Here is the output as a gist warning, it's quite big.

Basically it seems that node A is running ES version 1.3.1 and when I set up node B a week later or so, version 1.3.2 had been released in the meantime. For whatever reason, shard 4 always ends up being primary on B while the others are primary on A.

Replicating from 1.3.1 (A) -> 1.3.2 (B) works fine while the other way does not. I guess that makes sense.

I will bring the cluster down, upgrade A to 1.3.2 and start the cluster back up. In theory that should fix the problem. I'll report back once that is finished.

arctica commented 10 years ago

Sorry but I think the last gist got cut off, probably because it was too big. The relevant part is:

  {"explanations": [
    {
      "command": "allocate",
      "parameters": {
        "index": "logstash-2014.08.24",
        "shard": 4,
        "node": "N_HxeTu8StmMR_6sTk2faQ",
        "allow_primary": false
      },
      "decisions": [
        {
          "decider": "same_shard",
          "decision": "YES",
          "explanation": "shard is not allocated to same node or host"
        },
        {
          "decider": "filter",
          "decision": "YES",
          "explanation": "node passes include\/exclude\/require filters"
        },
        {
          "decider": "replica_after_primary_active",
          "decision": "YES",
          "explanation": "primary is already active"
        },
        {
          "decider": "throttling",
          "decision": "YES",
          "explanation": "below shard recovery limit of [2]"
        },
        {
          "decider": "enable",
          "decision": "YES",
          "explanation": "allocation disabling is ignored"
        },
        {
          "decider": "disable",
          "decision": "YES",
          "explanation": "allocation disabling is ignored"
        },
        {
          "decider": "awareness",
          "decision": "YES",
          "explanation": "no allocation awareness enabled"
        },
        {
          "decider": "shards_limit",
          "decision": "YES",
          "explanation": "total shard limit disabled: [-1] <= 0"
        },
        {
          "decider": "node_version",
          "decision": "NO",
          "explanation": "target node version [1.3.1] is older than source node version [1.3.2]"
        },
        {
          "decider": "disk_threshold",
          "decision": "YES",
          "explanation": "disk usages unavailable"
        },
        {
          "decider": "snapshot_in_progress",
          "decision": "YES",
          "explanation": "shard not primary or relocation disabled"
        }
      ]
    }
  ]}

clintongormley commented 10 years ago

OK - so you've got mixed versions of ES by the looks of it. You need to upgrade all of your nodes to the same version.

    {
      "decider": "node_version",
      "decision": "NO",
      "explanation": "target node version [1.3.1] is older than source node version [1.3.2]"
    },

arctica commented 10 years ago

Upgraded server A to 1.3.2, restarted cluster and we're back to green :)

Also recovery was way faster this time around. Maybe the slow recovery before was also caused by the version mismatch?

Anyways, I am glad the issue is resolved and would like to thank you, Clinton for your help! Much appreciated.

I think the visiblity of problems like this one needs to be improved, as I feel a lot of time could have been saved if the problem was more obvious. Feel free to use this issue as a feature request or we can close it and make a new one.

elastic / elasticsearch

Unassigned shards #7363