consulthys commented 7 months ago

Description

The indexing.index_failed counter returned from the Index Stats and Node Stats APIs currently counts how many indexing failures are happening at the index level, respectively the node level. A big shortcoming of this counter is that it batches many different "kinds" of indexing failures, some of which are important, and some aren't.

Digging forums and discussion boards, we found some vague indications on what all goes into this counter, namely "version conflicts and lucene analysis chain errors". Lucene analysis chain errors can be very important issues that the user might want to know about since they relate to document-level problems raised by Lucene. However, version conflict issues might come in at a much higher rate and it might also be the case that they are not relevant at all, as we'll see in the next two cases below, and when that's the case, they tend to swamp the Lucene ones.

What is needed?

It would be great if another counter would take care of counting version conflict issues and let the indexing.index_failed counter really be about only those "Lucene chain analysis errors". Ideally, if the latter can be further broken down into specific error categories, even better, but that's another effort. For now, we'd like to just focus on splitting version conflicts from the rest.

So instead of getting this:

GET .ds-.monitoring-es-8-mb-2024.04.14-000079/_stats?filter_path=**.index_failed
=>
  "indices": {
    ".ds-.monitoring-es-8-mb-2024.04.14-000079": {
      "primaries": {
        "indexing": {
          "index_failed": 12943888
        }
      }
    }
  }

We could get this:

GET .ds-.monitoring-es-8-mb-2024.04.14-000079/_stats?filter_path=**.index_failed,**.version_conflicts
=>
  "indices": {
    ".ds-.monitoring-es-8-mb-2024.04.14-000079": {
      "primaries": {
        "indexing": {
          "index_failed": 2,
          "version_conflicts": 12943886
        }
      }
    }
  }

As a bonus, we could even distinguish between general version conflicts and version conflicts on create operations since they are somewhat different in nature. Adding those two new fields would be backward compatible as the index_failed field would be left unchanged.

Why is it needed?

There are numerous cases where this could be useful, but the two main ones that would greatly benefit from this change are explained below.

Case 1: Put if absent (op_type = create)

One use case where it's perfectly ok to have version conflicts is when indexing data with op_type = create to make sure to not index a document that already exists. When running such a case, it starts looking as shown in the figure below:

Does the user have real Lucene analysis chain errors in there? At that indexing rate, there's no way to know unfortunately.

indexing_failed

Case 2: Plain usage of Stack Monitoring

Another case where this problem is even more painful is when monitoring the cluster with Metricbeat. When doing so, the monitoring indexes are constantly suffering from indexing failures (link to issue) at a steady rate.

312897434-cfae95e5-c20e-4471-a38c-9ebf255dc9b1

It took a while to discover what those indexing errors where, but the gist of it is that the elasticsearch module in Metricbeat sends elasticsearch.shard documents with an ID that is built using the cluster state_uuid value that doesn't change at the same cadence as the Metricbeat collection rounds, which ultimately causes version conflicts. There's nothing important about this issue, as it makes sense to not index the same data twice (i.e. cluster state hasn't changed), however, it's still counted as a noisy indexing failure. Probably that Metricbeat should only send elasticsearch.shard documents if the state_uuid value has changed, but that's another issue.

Some counter arguments

Indexing failures are client-side issues, let the user deal with them (i.e. enable logs and let them figure it out)

Agreed, but the problem here is that "client" encompasses a very wide range of different types of client components. It's not only about client-side applications that are under the control of the user, but also about any the Beats, Logstash, the elastic agent, Kibana, etc. For ESS users, those "clients" even run in ES Cloud, where they have no leverage at all.

Very few users complain about this, so why bother?

Fair enough, though that might also be because the index_failed metric is not shown anywhere in Stack Monitoring charts or anywhere else in Kibana. So unless the user looks at _cat/indices and specifically requests the iif field, she would never be aware of it.

elasticsearchmachine commented 7 months ago

Pinging @elastic/es-distributed (Team:Distributed)

Tim-Brooks commented 5 days ago

We discussed this as a team and agreed that we are aligned on two potential improvements.

More granular error counters for engine level indexing failures.

Currently as outlined in the ticket, we only have a single counter keeping track of the number of failures. We would support adding additional counters for specific exceptions that it makes sense to track (ex: mapping errors, version conflicts, etc). Our expectation is that these counters would be node level. It is probably challenging to have shard level counters without substantially increasing the amount of information collected.

A dedicated indexing error logger.

Currently there is no dedicated logger that can be enabled to surface indexing errors. It is possible to get them by enabled trace on certain loggers. However, these loggers will surface additional unrelated logs. We think it would be useful to add a specific logger which only logs indexing errors for users to need to temporarily diagnose issues. The logger would include shard level labels to help diagnose specific shards.

We expect that as various Elasticsearch teams/engineers encounter scenarios where it would be helpful to track this information that it will be incrementally added.

elastic / elasticsearch

Provide more insights into indexing failures #107601