elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.61k stars 24.63k forks source link

Improve information about _stats index_failures #80802

Open GlenRSmith opened 2 years ago

GlenRSmith commented 2 years ago

As is, when exceptions occur in InternalEngine.index, they bubble up and by way of a postIndex hook, two things happen:

This makes it very challenging to investigate the cause of indexing failures. I think even the most surgical setting of trace logging (I think it would be org.elasticsearch.index.shard.IndexShard ?) in production environment will result in pretty massive amounts of logging.

I've been able to figure out that, for example, that a org.elasticsearch.index.engine.VersionConflictEngineException will contribute to this count when e.g. trying to update a document with a lagging version number, but only by suspecting that to be the case and testing it in isolation.

I'm not really sure how, exactly, I would prefer to see this improved. One approach would be to add granularity to those stats; that seems like a fairly high bar to clear in justifying as it would be disruptive to the client-facing API. Changes to logging seems more palatable in that regard, and the lowest hanging approach might be to promote the [1] logger.trace in IndexShard.index to, at a minimum, logger.debug. Another approach would be to add logging at each of the places where a root cause failure occurs. Of course that would fan out the changes needed and would be more difficult to be certain all relevant scenarios are covered.

(I would contend that, regardless any effort to address the general request I'm making, the point-by-point places where relevant exceptions are raised should generate log entries, arguably as much as warn level.)

elasticmachine commented 2 years ago

Pinging @elastic/es-distributed (Team:Distributed)

GlenRSmith commented 2 years ago

Another error that does increment the count is

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "DocValuesField \"test_data\" is too large, must be <= 32766"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "DocValuesField \"test_data\" is too large, must be <= 32766"
  },
  "status": 400
}
DaveCTurner commented 2 years ago

The thinking here is that errors due to client requests don't really have a place in the server log - the folks with access to the server logs are often not the right people to address any problems with requests from clients, and failed indexing is rarely of concern to the server admin. It certainly doesn't warrant a WARN-level log entry for every such exception. Instead, we return the exception in the HTTP response and expect the client to handle it appropriately by logging it or retrying or raising an alert etc.

(edit: I see some value in logging these exceptions at DEBUG rather than TRACE, but no higher)

consulthys commented 1 year ago

I encountered a similar issue where all monitoring indexes actually suffer from indexing failures (see last column in the screenshot below), no other "business" indexes has this problem. So here the "client" is Metricbeat monitoring ES nodes but it's not clear from the logs why this is happening.

GET _cat/shards?v&s=iif:desc&h=index,shard,docs,id,iif

index_failed
DaveCTurner commented 5 months ago

Sorry for the delay in responding here @consulthys. We wouldn't expect Elasticsearch to report these things in its logs, but they should be reported by Metricbeat (if they're not benign anyway -- which I hope they are given that they seem to happen millions of times per day)

consulthys commented 5 months ago

Thanks @DaveCTurner! Thanks to our investigations, we've been able to file a new enhancement request for Elasticsearch to add an additional counter for version conflicts, as well as another enhancement request for Metricbeat that fixes the way elasticsearch.shard documents are sent to ES.