Open tetianakravchenko opened 1 year ago
Note for the same timestamp Aug 24, 2023 @ 16:58:06.491 there are 5 documents with the same set of labels (prometheus.labels_id is a fingerprint of the prometheus.labels), for some of them ingestion time is different, but mainly even the event.ingested is the same:
What's the reason for the seemingly duplicated metrics? Are they actual duplicates? They seem to come from the same pod id and the same time, how does that happen? Or is that missing a dimension? But which dimension is missing, if any? What would be the impact of dropping duplicate metrics of that sort?
Are the metrics coming from the same prometheus instance? If so, may the remote write configuration be faulty so that it writes multiple times, for example, because the same remote write endpoint is configured multiple times?
Note for the same timestamp Aug 24, 2023 @ 16:58:06.491 there are 5 documents with the same set of labels (prometheus.labels_id is a fingerprint of the prometheus.labels), for some of them ingestion time is different, but mainly even the event.ingested is the same:
What's the reason for the seemingly duplicated metrics?
I think it depends on max_samples_per_send
(default: 500) and the fact that prometheus send all metrics not in 1 batch, metricbeat in its turn performs grouping per batch. I've run this test:
prometheus configuration - scrape only prometheus server metrics:
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
total amount of metrics - 630:
root@test-worker2:/usr/share/elastic-agent# curl -s prometheus-server-server.default:80/metrics | grep -v ^# | wc -l
630
when running TSDB-migration-test-kit - I see that some documents are overlapping if using the prometheus.labels_id
as a main dimension, that suppose to distinguish documents (similar to collector datastream)
if setting:
remoteWrite:
- url: http://elastic-agent.kube-system:9201/write
queue_config:
max_samples_per_send: 1000 (higher 630)
labels_id
can be used as a dimension - there are no overlappings.
Are they actual duplicates?
No, it is not correct name for this issue, I've renames issues to Metrics are not grouped by labels
They seem to come from the same pod id and the same time, how does that happen? Or is that missing a dimension? But which dimension is missing, if any?
the first question I believe is covered by the test above. Yes, we are missing some dimension in this case.
I am trying to investigate this approach:
prometheus.labels
prometheus.labels.metric_names
prometheus: {
"name1": {
"counter": <val>
},
"name2": {
"value": <val>
}
"labels": {
metric_names: [
"name1",
"name2"
]
},
"labels_id": <fingerprint>
}
It might be not a perfect solution, as fingerprint might change when number of scraping endpoints/metrics will change.
Other option would be to update already published document with the specific timestamp that has the same set of labels/labels_id fingerprint.
What would be the impact of dropping duplicate metrics of that sort?
it would imply data loss.
Are the metrics coming from the same prometheus instance? If so, may the remote write configuration be faulty so that it writes multiple times, for example, because the same remote write endpoint is configured multiple times?
Yes, data is coming from the same prometheus. Configuration:
prometheus.yml: |
global:
evaluation_interval: 1m
scrape_interval: 1m
scrape_timeout: 10s
remote_write:
url: http://elastic-agent.kube-system:9201/write
rule_files:
- /etc/config/recording_rules.yml
- /etc/config/alerting_rules.yml
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090
I got it now, thanks for the explanations and the detailed analysis!
The underlying issue here is that metric names are not part of the _tsid. I thought this was mostly a non-issue as all metrics for the same _tsid are usually in the same document. You've found a good example where this isn't the case.
use fingerprint to calculate prometheus.labels_id that includes prometheus.labels.metric_names
I think this is a good short-term workaround. But maybe I'd slightly change the approach. Instead of adding metric_names to labels, this could be top-level and marked as a dimension instead of being part of the labels fingerprint. Ultimately, it doesn't matter too much.
The more interesting discussing is that I think TSDB should add metric names to _tsid. That's because the name of a metric is part of the identity of a metric. See also the definition of a time series, according to the OpenTelemetry metrics data model: https://opentelemetry.io/docs/specs/otel/metrics/data-model/#timeseries-model
cc @martijnvg
Instead of adding metric_names to labels, this could be top-level and marked as a dimension instead of being part of the labels fingerprint.
In the end both fingerprints are needed - labels fingerprint and metric name fingerprint must be defined as a dimensions. My motivation was
metric_name{label_name=X}
== {__name__=metric_name, label_name=X}
)Could you please explain why it should be added on the top-level, instead of being part of the labels fingerprint?
I am also not planning to store the metrics_names as it is a redundant information and can impact the documents size, I am planning to use smth like - https://github.com/elastic/integrations/pull/7565/files#diff-03b3cb0809132fbdf6119d02478854a135b678fb0e2db1d689cf6b44804daba1R2-R19 (not sure yet 2 vs 1 fingerprints)
+1 on on not storing metric_names
, just the fingerprint.
Could you please explain why it should be added on the top-level, instead of being part of the labels fingerprint?
I don't have a strong opinion here and I don't think it matters too much. I was just my first instinct to not store the fingerprint under labels.*
to avoid field suggestions for that fingerprint field when someone types in labels.
. Again, not something to worry about too much. We should consider these fields to be an implementation detail that we can always change without it being a breaking change.
question: if we have the same metrics_name object [<a>, <b>, <c>]
vs [<c>, <b>, <a>]
will the fingerprint be the same?
No, it won't. In order to assure that, you'll need to sort the array first. The same applies to the labels fingerprint btw.
No, it won't. In order to assure that, you'll need to sort the array first. The same applies to the labels fingerprint btw.
thank you for the reply! i am not sure that for labels it applies. If I correctly understand fingerprint implementation https://github.com/elastic/elasticsearch/blob/main/modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/FingerprintProcessor.java#L110-L122 - map should be sorted and processed in a consistent order
labels object looks like this:
prometheus: {
"name1": {
"counter": <val>
},
"name2": {
"value": <val>
}
"labels": {
metric_names: [
"name1",
"name2"
],
"key1": "value1"
},
"labels_fingerprint": <fingerprint>
}
Ah, I didn't know that the fingerprint processor is already sorting the values! Looks good then.
For remote_write it is not possible to define a list of dimenstions, that would prevent documents duplications and dropping documents in the end (when enabling tsdb). All metrics are not grouped by the unique list of labels, as it is for collector datastream.
Example:
Note for the same timestamp
Aug 24, 2023 @ 16:58:06.491
there are 5 documents with the same set of labels (prometheus.labels_id
is a fingerprint of theprometheus.labels
), for some of themingestion
time is different, but mainly even theevent.ingested
is the same:prometheus.labels_id
is a fingerprint of theprometheus.labels
object (this approach is used for thecollector
datastream):document sample 1:
Document sample 2: