Open consulthys opened 7 months ago
Pinging @elastic/es-distributed (Team:Distributed)
We discussed this as a team and agreed that we are aligned on two potential improvements.
Currently as outlined in the ticket, we only have a single counter keeping track of the number of failures. We would support adding additional counters for specific exceptions that it makes sense to track (ex: mapping errors, version conflicts, etc). Our expectation is that these counters would be node level. It is probably challenging to have shard level counters without substantially increasing the amount of information collected.
Currently there is no dedicated logger that can be enabled to surface indexing errors. It is possible to get them by enabled trace
on certain loggers. However, these loggers will surface additional unrelated logs. We think it would be useful to add a specific logger which only logs indexing errors for users to need to temporarily diagnose issues. The logger would include shard level labels to help diagnose specific shards.
We expect that as various Elasticsearch teams/engineers encounter scenarios where it would be helpful to track this information that it will be incrementally added.
Description
The
indexing.index_failed
counter returned from the Index Stats and Node Stats APIs currently counts how many indexing failures are happening at the index level, respectively the node level. A big shortcoming of this counter is that it batches many different "kinds" of indexing failures, some of which are important, and some aren't.Digging forums and discussion boards, we found some vague indications on what all goes into this counter, namely "version conflicts and lucene analysis chain errors". Lucene analysis chain errors can be very important issues that the user might want to know about since they relate to document-level problems raised by Lucene. However, version conflict issues might come in at a much higher rate and it might also be the case that they are not relevant at all, as we'll see in the next two cases below, and when that's the case, they tend to swamp the Lucene ones.
What is needed?
It would be great if another counter would take care of counting version conflict issues and let the
indexing.index_failed
counter really be about only those "Lucene chain analysis errors". Ideally, if the latter can be further broken down into specific error categories, even better, but that's another effort. For now, we'd like to just focus on splitting version conflicts from the rest.So instead of getting this:
We could get this:
As a bonus, we could even distinguish between general version conflicts and version conflicts on create operations since they are somewhat different in nature. Adding those two new fields would be backward compatible as the
index_failed
field would be left unchanged.Why is it needed?
There are numerous cases where this could be useful, but the two main ones that would greatly benefit from this change are explained below.
Case 1: Put if absent (op_type = create)
One use case where it's perfectly ok to have version conflicts is when indexing data with
op_type = create
to make sure to not index a document that already exists. When running such a case, it starts looking as shown in the figure below:Does the user have real Lucene analysis chain errors in there? At that indexing rate, there's no way to know unfortunately.
Case 2: Plain usage of Stack Monitoring
Another case where this problem is even more painful is when monitoring the cluster with Metricbeat. When doing so, the monitoring indexes are constantly suffering from indexing failures (link to issue) at a steady rate.
It took a while to discover what those indexing errors where, but the gist of it is that the
elasticsearch
module in Metricbeat sendselasticsearch.shard
documents with an ID that is built using the clusterstate_uuid
value that doesn't change at the same cadence as the Metricbeat collection rounds, which ultimately causes version conflicts. There's nothing important about this issue, as it makes sense to not index the same data twice (i.e. cluster state hasn't changed), however, it's still counted as a noisy indexing failure. Probably that Metricbeat should only sendelasticsearch.shard
documents if thestate_uuid
value has changed, but that's another issue.Some counter arguments
Indexing failures are client-side issues, let the user deal with them (i.e. enable logs and let them figure it out)
Agreed, but the problem here is that "client" encompasses a very wide range of different types of client components. It's not only about client-side applications that are under the control of the user, but also about any the Beats, Logstash, the elastic agent, Kibana, etc. For ESS users, those "clients" even run in ES Cloud, where they have no leverage at all.
Very few users complain about this, so why bother?
Fair enough, though that might also be because the
index_failed
metric is not shown anywhere in Stack Monitoring charts or anywhere else in Kibana. So unless the user looks at_cat/indices
and specifically requests theiif
field, she would never be aware of it.