_ignored meta field index stats

flash1293 commented 6 months ago

Description

The recently introduced _ignored meta field (https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-ignored-field.html) is helpful to detect issues with data ingestion.

However, it requires read permissions and potentially expensive search queries to get basic statistics.

Adding information about _ignored field usage to the index stats would allow to monitor it in a cheap way.

The API could look like this:

GET my-index/_stats

{
  "_all": {
    "total": {
      "ignored_fields": {   
        "degraded_docs": 123
     }
  }
}

Implementation notes: https://github.com/elastic/elasticsearch/issues/108092#issuecomment-2116045095

elasticsearchmachine commented 6 months ago

Pinging @elastic/es-search (Team:Search)

felixbarny commented 6 months ago

it requires read permissions

Could you expand on why this is problematic?

potentially expensive search queries to get basic statistics

@martijnvg or @javanna could you help us understand how expensive it is to do an exists query on the _ignored field (which now has doc_values)? Are there optimizations/short-circuits when an entire index is matched?

Check how much space is used for storing it

The field usage API may be useful for that.

Are you interested in the number of ignored fields for an entire data stream or index? Or do you want the ability to filter based on a time range?

Is it for internal telemetry purposes or to show this information to users? One option would be to just track a counter metric via APM. That won't show you the total number/ratio of _ignored docs but you could plot that number over time.

If you want to show the ratio of ignored docs in the dataset quality page, which supports selecting the time range, maybe just doing a simple exists query on the _ignored field is your best bet.

flash1293 commented 6 months ago

Could you expand on why this is problematic?

This is very specific about the potential use of collecting telemetry on this from Kibana - giving full read access to all indices to all Kibana background tasks is problematic as it violates the least-privilege concept.

Are you interested in the number of ignored fields for an entire data stream or index? Or do you want the ability to filter based on a time range?

@salvatore-campagna brought up the storage angle, that's why I opened this issue - he can comment better on it.

Is it for internal telemetry purposes or to show this information to users?

My use case is internal telemetry purposes, but after talking to @salvatore-campagna we thought that there might be other potential use cases.

One option would be to just track a counter metric via APM. That won't show you the total number/ratio of _ignored docs but you could plot that number over time.

You mean it won't show the total number because document deletes/updates are not accounted for properly?

Tracking a counter metric sounds like it would solve the telemetry use case, as this is less about perfect fidelity and more about being able to track the rough number over time as a success metric.

felixbarny commented 6 months ago

You mean it won't show the total number because document deletes/updates are not accounted for properly?

When tracking a counter, the retention of the data itself and the metrics are decoupled. You may retain the metrics for a month but the logs are only retained for a week. I don't think this is necessarily an issue but it means you can't answer questions like how many _ignored fields are set in this particular data stream right now. But you can still track the rate of ingested documents that have ignored fields over time. And I think that's the relevant thing.

flash1293 commented 6 months ago

@felixbarny If we do this, we could also have two counters (one for healthy docs and one for degraded docs) to get a ratio for "health of incoming docs", right?

felixbarny commented 6 months ago

Yes, and we should probably also have counters for documents that enter the failure store and documents that get rejected.

flash1293 commented 6 months ago

Sounds like a good solution! If I'm understanding correctly, these numbers would only be held in memory, so they would be reset on restart, but that wouldn't matter to us so much as we are mostly worried about the ratio.

When using APM counter metrics for this, how would we report these numbers as telemetry? Is this something Elasticsearch is doing internally or would it still be about exposing the values via an endpoint / part of an existing endpoint and Kibana leverages it?

felixbarny commented 6 months ago

Yes, the counters are reset on restart and every ES node has their own counter. But that shouldn't matter too much. We should be able to do rate aggregations on these counters and also calculate percentages over time. Depending on how fine-grained we want to track the data, we could add dimensions for the index/data stream name. However, there's a risk that the index name is high cardinality, which would impact the memory overhead in ES, and the storage requirements, and the query latency when aggregating lots of time series. We talked about only tracking metrics per data stream for managed data streams, which would include our integration data streams but not custom ones.

felixbarny commented 6 months ago

When using APM counter metrics for this, how would we report these numbers as telemetry? Is this something Elasticsearch is doing internally or would it still be about exposing the values via an endpoint / part of an existing endpoint and Kibana leverages it?

I believe we report these by default for serverless but don't do it for ESS or on-prem. Enabling for ESS seems feasible, though. If we want something that works for on-prem, I don't think there's an alternative to doing it in Kibana telemetry. We can also combine the two approaches.

flash1293 commented 6 months ago

We talked about only tracking metrics per data stream for managed data streams, which would include our integration data streams but not custom ones.

Tracking this for non-managed data streams would be important I think.

A detailed time series with a per-minute resolution or something would be less important than a more granular view per day or so if that helps (this is also what we would do on Kibana side)

salvatore-campagna commented 6 months ago

The reason why I mentioned the storage aspect of this is that, for _ignored, as for other metadata fields, understanding disk usage might also be important to reason about costs that we propagate to users. This is because , at least with Serverless disk usage pricing model, the amount of data we index in metadata fields matters.

For the ignored field specific case anyway we have that:

data in transit: data that is sent in a field which is going to be ignored later, because it does not match the field type, is accounted for, since the bytes are counted as part of the total number of bytes of the original document.
data at rest: the only difference here is that instead of storing the field value bytes as part of the field they belong to, the field name of the malformed field ends up in the _ignored field. So I would say that is also accounted for because we are not storing the original bytes in the intended field (due to malformed field) but just the field name (note that size(ignored_field_name) will normally be different from size(ignored_field_value)). Note also that when using synthetic source we actually also store the value in a stored field for malformed fields so that later on we are able to reconstruct it.

So for _ignored we probably need to count documents having at least one ignored field and total number of documents (which we already have).

Anyway, in general for metadata fields, including _ignored, we might want to track the number of bytes we store. This might be useful for other fields we use under the hood like stored fields used for ignored values or to support synthetic source. I would go as far as just tracking the number of bytes stored per field including regular and metadata fields but that would result in a lot of time series to store as a result of the cardinality of all fields (one counter per field). For this reason, to avoid explosion in the number of time series we need to store it might make sense we only track bytes stored for metadata fields. Having that we could at least compare storage required for metadata versus overall storage. What do you think?

Also knowing how many bytes we store in stored fields used for synthetic source fallback solution might be useful for us and for our customers to have a kind of measure of "synthetic source effectiveness".

I see measuring storage for metadata fields as a way to measure storage overhead...which might be useful to decide on pricing (for us) but also for our customers to decide on things like adoption of synthetic source, or adoption of a different index mode.

flash1293 commented 6 months ago

Thanks for the context @salvatore-campagna

To summarize:

APM could be used for ES-internal telemetry, but it wouldn't be available on on-prem and it might not be feasible to track it with high granularity (per datastream / index)
We could build a separate thing to track it per index as a "side effect" if we are anyway tracking storage for these fields separately

salvatore-campagna commented 6 months ago

I was wondering if this can be done at Lucene segment level just by storing an (additional) number, or a counter to be more precise. Thinking about this we have the following:

if the _ignored field does not exist in the segment it means there are 0 documents with ignored fields
if the _ignored field exist we just use a per segment counter (we dont need to do anything specific just count how many documents have the _ignored field when flush happens)
when merging two segments we just need to sum those counters up

The stats api would just go there and fetch a number eventually summing up values if an index has multiple segments.

Drawbacks:

Additional Lucene format to maintain
Backward compatibility (just return 0 count for old indices/segments?)
Dealing with deletes and no-ops to keep the counter up-to-date is challenging
Documents still in memory and not flushed to Lucene segments would not contribute to the count

salvatore-campagna commented 6 months ago

@jpountz any idea about this?

I think we should not do this unless there is actually a better and more efficient way (with respect to running the aggregation).

jpountz commented 6 months ago

@salvatore-campagna If I read the source code correctly, the _ignored field has an inverted index, so it already provides us with index statistics (see the org.apache.lucene.index.Terms class):

getDocCount(): number of documents with one or more ignored fields
getSumDocFreq(): number of ignored fields

As you probably guessed, these statistics ignore deletes and documents that are only in the IndexWriter buffer (not flushed yet).

salvatore-campagna commented 6 months ago

This will be enough just to have a counter for documents so that we can efficiently count the ratio of documents having the _ignored field and not for counting bytes used. I believe that is enough anyway.

flash1293 commented 6 months ago

Thanks Adrien and Salvatore - I agree that this would be enough, updated the description.

salvatore-campagna commented 5 months ago

@flash1293

What about a response like this?

"_all" : {
   "primaries" : {
     "docs" : {
       "count" : 4,
       "deleted" : 0,
       "total_size_in_bytes" : 18868,
       "docs_with_ignored_fields" : 2
     }
   },
   "total" : {
     "docs" : {
       "count" : 4,
       "deleted" : 0,
       "total_size_in_bytes" : 18868,
       "docs_with_ignored_fields" : 2
     }
   }
 },

This way you have values to calculate percentage close to each other.

flash1293 commented 5 months ago

This looks great @salvatore-campagna !

salvatore-campagna commented 5 months ago

Note that this won't work for indices restored from source only snapshots too...since they are missing doc values and inverted index (the inverted index is required to count documents including the _ignored field). It would work, anyway, after reindexing. See https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshots-source-only-repository.html#snapshots-source-only-repository.

elasticsearchmachine commented 5 months ago

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elastic / elasticsearch

_ignored meta field index stats #108092

Description