Open flash1293 opened 6 months ago
Pinging @elastic/es-search (Team:Search)
it requires read permissions
Could you expand on why this is problematic?
potentially expensive search queries to get basic statistics
@martijnvg or @javanna could you help us understand how expensive it is to do an exists query on the _ignored field (which now has doc_values)? Are there optimizations/short-circuits when an entire index is matched?
Check how much space is used for storing it
The field usage API may be useful for that.
Are you interested in the number of ignored fields for an entire data stream or index? Or do you want the ability to filter based on a time range?
Is it for internal telemetry purposes or to show this information to users? One option would be to just track a counter metric via APM. That won't show you the total number/ratio of _ignored docs but you could plot that number over time.
If you want to show the ratio of ignored docs in the dataset quality page, which supports selecting the time range, maybe just doing a simple exists query on the _ignored field is your best bet.
Could you expand on why this is problematic?
This is very specific about the potential use of collecting telemetry on this from Kibana - giving full read access to all indices to all Kibana background tasks is problematic as it violates the least-privilege concept.
Are you interested in the number of ignored fields for an entire data stream or index? Or do you want the ability to filter based on a time range?
@salvatore-campagna brought up the storage angle, that's why I opened this issue - he can comment better on it.
Is it for internal telemetry purposes or to show this information to users?
My use case is internal telemetry purposes, but after talking to @salvatore-campagna we thought that there might be other potential use cases.
One option would be to just track a counter metric via APM. That won't show you the total number/ratio of _ignored docs but you could plot that number over time.
You mean it won't show the total number because document deletes/updates are not accounted for properly?
Tracking a counter metric sounds like it would solve the telemetry use case, as this is less about perfect fidelity and more about being able to track the rough number over time as a success metric.
You mean it won't show the total number because document deletes/updates are not accounted for properly?
When tracking a counter, the retention of the data itself and the metrics are decoupled. You may retain the metrics for a month but the logs are only retained for a week. I don't think this is necessarily an issue but it means you can't answer questions like how many _ignored fields are set in this particular data stream right now. But you can still track the rate of ingested documents that have ignored fields over time. And I think that's the relevant thing.
@felixbarny If we do this, we could also have two counters (one for healthy docs and one for degraded docs) to get a ratio for "health of incoming docs", right?
Yes, and we should probably also have counters for documents that enter the failure store and documents that get rejected.
Sounds like a good solution! If I'm understanding correctly, these numbers would only be held in memory, so they would be reset on restart, but that wouldn't matter to us so much as we are mostly worried about the ratio.
When using APM counter metrics for this, how would we report these numbers as telemetry? Is this something Elasticsearch is doing internally or would it still be about exposing the values via an endpoint / part of an existing endpoint and Kibana leverages it?
Yes, the counters are reset on restart and every ES node has their own counter. But that shouldn't matter too much. We should be able to do rate aggregations on these counters and also calculate percentages over time. Depending on how fine-grained we want to track the data, we could add dimensions for the index/data stream name. However, there's a risk that the index name is high cardinality, which would impact the memory overhead in ES, and the storage requirements, and the query latency when aggregating lots of time series. We talked about only tracking metrics per data stream for managed data streams, which would include our integration data streams but not custom ones.
When using APM counter metrics for this, how would we report these numbers as telemetry? Is this something Elasticsearch is doing internally or would it still be about exposing the values via an endpoint / part of an existing endpoint and Kibana leverages it?
I believe we report these by default for serverless but don't do it for ESS or on-prem. Enabling for ESS seems feasible, though. If we want something that works for on-prem, I don't think there's an alternative to doing it in Kibana telemetry. We can also combine the two approaches.
We talked about only tracking metrics per data stream for managed data streams, which would include our integration data streams but not custom ones.
Tracking this for non-managed data streams would be important I think.
A detailed time series with a per-minute resolution or something would be less important than a more granular view per day or so if that helps (this is also what we would do on Kibana side)
The reason why I mentioned the storage aspect of this is that, for _ignored
, as for other metadata fields, understanding disk usage might also be important to reason about costs that we propagate to users. This is because , at least with Serverless disk usage pricing model, the amount of data we index in metadata fields matters.
For the ignored field specific case anyway we have that:
So for _ignored we probably need to count documents having at least one ignored field and total number of documents (which we already have).
Anyway, in general for metadata fields, including _ignored, we might want to track the number of bytes we store. This might be useful for other fields we use under the hood like stored fields used for ignored values or to support synthetic source. I would go as far as just tracking the number of bytes stored per field including regular and metadata fields but that would result in a lot of time series to store as a result of the cardinality of all fields (one counter per field). For this reason, to avoid explosion in the number of time series we need to store it might make sense we only track bytes stored for metadata fields. Having that we could at least compare storage required for metadata versus overall storage. What do you think?
Also knowing how many bytes we store in stored fields used for synthetic source fallback solution might be useful for us and for our customers to have a kind of measure of "synthetic source effectiveness".
I see measuring storage for metadata fields as a way to measure storage overhead...which might be useful to decide on pricing (for us) but also for our customers to decide on things like adoption of synthetic source, or adoption of a different index mode.
Thanks for the context @salvatore-campagna
To summarize:
I was wondering if this can be done at Lucene segment level just by storing an (additional) number, or a counter to be more precise. Thinking about this we have the following:
_ignored
field does not exist in the segment it means there are 0 documents with ignored fields_ignored
field exist we just use a per segment counter (we dont need to do anything specific just count how many documents have the _ignored
field when flush happens)The stats api would just go there and fetch a number eventually summing up values if an index has multiple segments.
Drawbacks:
@jpountz any idea about this?
I think we should not do this unless there is actually a better and more efficient way (with respect to running the aggregation).
@salvatore-campagna If I read the source code correctly, the _ignored
field has an inverted index, so it already provides us with index statistics (see the org.apache.lucene.index.Terms
class):
getDocCount()
: number of documents with one or more ignored fieldsgetSumDocFreq()
: number of ignored fieldsAs you probably guessed, these statistics ignore deletes and documents that are only in the IndexWriter
buffer (not flushed yet).
This will be enough just to have a counter for documents so that we can efficiently count the ratio of documents having the _ignored
field and not for counting bytes used. I believe that is enough anyway.
Thanks Adrien and Salvatore - I agree that this would be enough, updated the description.
@flash1293
What about a response like this?
"_all" : {
"primaries" : {
"docs" : {
"count" : 4,
"deleted" : 0,
"total_size_in_bytes" : 18868,
"docs_with_ignored_fields" : 2
}
},
"total" : {
"docs" : {
"count" : 4,
"deleted" : 0,
"total_size_in_bytes" : 18868,
"docs_with_ignored_fields" : 2
}
}
},
This way you have values to calculate percentage close to each other.
This looks great @salvatore-campagna !
Note that this won't work for indices restored from source only
snapshots too...since they are missing doc values and inverted index (the inverted index is required to count documents including the _ignored
field). It would work, anyway, after reindexing. See https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshots-source-only-repository.html#snapshots-source-only-repository.
Pinging @elastic/es-storage-engine (Team:StorageEngine)
Description
The recently introduced
_ignored
meta field (https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-ignored-field.html) is helpful to detect issues with data ingestion.However, it requires
read
permissions and potentially expensive search queries to get basic statistics.Adding information about
_ignored
field usage to the index stats would allow to monitor it in a cheap way.The API could look like this:
Implementation notes: https://github.com/elastic/elasticsearch/issues/108092#issuecomment-2116045095