log unsuccessful shards in failed scrolls

tommyzli commented 4 years ago

Elasticsearch version (bin/elasticsearch --version): 7.6.1

elasticsearch-py version (elasticsearch.__versionstr__): 7.5.1

Description of the problem including expected versus actual behavior:

The scan() helper function only logs the number of successful vs failed shards. It would be helpful to also log the shards that failed, so I can quickly jump onto the node and grab the appropriate server logs. That data is a part of the response, but gets thrown away by the client.

Steps to reproduce: A call to scan(client, query, raise_on_error=True) fails and throws ScanError("Scroll request has only succeeded on 9 (+0 skiped) shards out of 10.")

Proposed error: ScanError("Scroll request has only succeeded on 9 (+0 skipped) shards out of 10. First failure: node 'foo', shard 'bar', reason 'reason'")

bartier commented 4 years ago

@tommyzli Unfortunately I did not find the information of shards/nodes that were not successful to answer using scroll API. I may be forgetting something, but only the successful/total shards counts is presented in the raw response.

The only exception where I could see scroll API send information about nodes that do not return successfully is when the _scroll_id request has answered initially with all shards successfully (for example 3/3) and when consuming the scroll API shards become unavailable because a node is unreachable (2/3 primary shards available in example bewlow). Is that what you are referring to?

1) Request _scroll_id with all shards available (3/3 in example)

POST /twitter/_search?scroll=1m&pretty HTTP/1.1
{
    "size": 3,
    "query": {
        "match_all" : {}
    }
}
# Response 3/3 shards
{
  "_scroll_id" : "FGluY2x1ZGVfY29udGV4dF91dWlkDnF1ZXJ5VGhlbkZldGNoAxRjQTM2SVhNQk1hMC15cHkyd2o4egAAAAAAAAAYFkZzZ2tJN2JrVEMtc1RUbGcxcWl2TmcUelRyNklYTUJQQUNTcE1WendnVXcAAAAAAAAA_BYzYXp1WXJ1LVRwV1JSd004dlV2YmNRFFlldjZJWE1CY3ZEeDdNb253aDR4AAAAAAAAAAgWdnNSSTJPSVpRd0dJbUxvN3RZX3I2QQ==",
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 25,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [...]
  }
}

2) Some shards become unavailable when consuming the _scroll_id

POST /_search/scroll?pretty HTTP/1.1
{
    "scroll" : "1m", 
    "scroll_id" : "FGluY2x1ZGVfY29udGV4dF91dWlkDnF1ZXJ5VGhlbkZldGNoAxRjQTM2SVhNQk1hMC15cHkyd2o4egAAAAAAAAAYFkZzZ2tJN2JrVEMtc1RUbGcxcWl2TmcUelRyNklYTUJQQUNTcE1WendnVXcAAAAAAAAA_BYzYXp1WXJ1LVRwV1JSd004dlV2YmNRFFlldjZJWE1CY3ZEeDdNb253aDR4AAAAAAAAAAgWdnNSSTJPSVpRd0dJbUxvN3RZX3I2QQ==" 
}
# Response shows node unreachable, then 2/3 shards available
{
  "_scroll_id" : "FGluY2x1ZGVfY29udGV4dF91dWlkDnF1ZXJ5VGhlbkZldGNoAxRjQTM2SVhNQk1hMC15cHkyd2o4egAAAAAAAAAYFkZzZ2tJN2JrVEMtc1RUbGcxcWl2TmcUelRyNklYTUJQQUNTcE1WendnVXcAAAAAAAAA_BYzYXp1WXJ1LVRwV1JSd004dlV2YmNRFFlldjZJWE1CY3ZEeDdNb253aDR4AAAAAAAAAAgWdnNSSTJPSVpRd0dJbUxvN3RZX3I2QQ==",
  "took" : 6,
  "timed_out" : false,
  "terminated_early" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 1,
    "failures" : [
      {
        "shard" : -1,
        "index" : null,
        "reason" : {
          "type" : "illegal_state_exception",
          "reason" : "node [FsgkI7bkTC-sTTlg1qivNg] is not available"
        }
      }
    ]
  },
  "hits" : {
    "total" : {
      "value" : 15,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [...]
  }
}

tommyzli commented 4 years ago

@bartier yeah, the case I saw was that a shard failed after already scrolling through a few pages. I'm thinking the code should check if error messages were included in the response and log them if so.

Amirilw commented 3 years ago

Did this ever got resolved ? I’m running to the same issue.

Amirilw commented 3 years ago

Ok, after debugging this issue for few days , splitting shards and adding nodes we found out the the main issue was heapsize on JVM size.

It was using the default of 1GB instead of 32 as the rest of the nodes.

When we saw it first: we started to the issue after new nodes have joined the cluster and had the same hardware spec and elastic config.

debug: python log didn’t give us any useful information about the issue just the error on the shards, our monitoring system didn’t report any issues as ram was consumed in normal ranges, after investigation we saw ram consumption was off for the new nodes (disk io and utilization were as expected and CPU)

Cluster version : 7.8.0 Python elastic version: 5.4.x/7.8/7.13

Solution:

Configured heapsize under jvm options to 32gb ram and reload the elastic service.

elastic / elasticsearch-py

log unsuccessful shards in failed scrolls #1261