cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.45k stars 791 forks source link

Are huge amounts of 'sample timestamp out of order' logs normal? #3411

Closed jakubgs closed 3 years ago

jakubgs commented 3 years ago

Description

Every time I modify Cortex configuration and restart the nodes they generate ungodly amounts of logs like this:

msg="push error" err="rpc error: code = Code(400) desc = user=fake: sample timestamp out of order; last timestamp: ...

I assume this is because the upstream Prometheus instance is re-trying pushing of the metrics that failed when node was down.

It generates quite a lot of them...

 > sudo journalctl -a -u cortex | grep 'sample timestamp out of order' | wc -l
173476

Questions

pstibrany commented 3 years ago
msg="push error" err="rpc error: code = Code(400) desc = user=fake: sample timestamp out of order; last timestamp: ...

This happens when Prometheus sends old samples for series, for which Cortex already has newer samples. Cortex can only ingest samples for the series in order.

I assume this is because the upstream Prometheus instance is re-trying pushing of the metrics that failed when node was down.

On restart, Prometheus tries to push all data from its WAL to remote write. It doesn't remember how much it has pushed before the restart.

Another source of "sample timestamp out of order" is when running multiple Prometheus servers pushing the same metrics, without proper HA deduplication configuration in Cortex (https://cortexmetrics.io/docs/guides/ha-pair-handling/).

It generates quite a lot of them...

 > sudo journalctl -a -u cortex | grep 'sample timestamp out of order' | wc -l
173476

Questions

  • Is Cortex incapable of ingesting old metrics re-pushed by Prometheus? Or am I doing something wrong?

See above.

  • If Cortex is incapable of ingesting old metrics, why is this an error rather than a warning or even a debug message?

It can indicate frequent Prometheus restarts or misconfiguration on Cortex side.

  • Can I stop this specific message from spamming my logs somehow?

Not that I know of. But you're right that it's logging too much, and we can improve it.

jakubgs commented 3 years ago

Thanks for explaining, but this lasts for a LONG time. I've restarted my Cortex 11 minutes ago and the logs are still full of these messages. How long should it take them to stop?

pstibrany commented 3 years ago

That depends on the cause of this problem. If you have multiple Prometheus servers sending the same series (common in HA setup), then it will only stop once Cortex is properly configured. See the linked article above.

jakubgs commented 3 years ago

No, I do not have a setup like that. I have a single Prometheus instance per DC, and there should be no duplicates.

For example. I restarted 2 out of 3 Cortex instances. The two restarted ones are spamming logs with copious amounts of sample timestamp out of order while the one that was not restarted is not.

pstibrany commented 3 years ago

I have a single Prometheus instance per DC

Do you have multiple DCs? Is it possible that they are sending the same series to Cortex? Those sample timestamp out of order messages include full series information, which can be used to track this down. Can you share some of them?

pstibrany commented 3 years ago

You may want to use external_labels in your Prometheus config to fix this (see https://prometheus.io/docs/prometheus/latest/configuration/configuration/)

jakubgs commented 3 years ago

No, it is not possible. Examples:

{__name__=\"netdata_services_cpu_percentage_average\", chart=\"services.cpu\", datacenter=\"gc-us-central1-a\", dimension=\"systemd-journald\", family=\"cpu\", fleet=\"thehive.prod\", group=\",monitor,metrics-source,\", instance=\"slave-02.gc-us-central1-a.thehive.prod\", job=\"netdata\", source=\"slave-01.gc-us-central1-a.metrics.hq\"}
{__name__=\"netdata_cgroup_writeback_MiB_average\", chart=\"cgroup_elasticsearch_hq.writeback\", datacenter=\"gc-us-central1-a\", dimension=\"dirty\", family=\"mem\", fleet=\"thehive.prod\", group=\",monitor,metrics-source,\", instance=\"slave-02.gc-us-central1-a.thehive.prod\", job=\"netdata\", source=\"slave-01.gc-us-central1-a.metrics.hq\"}
{__name__=\"netdata_ip_tcpreorders_packets_persec_average\", chart=\"ip.tcpreorders\", datacenter=\"aws-eu-central-1a\", dimension=\"fack\", family=\"tcp\", fleet=\"nimbus.test\", group=\",monitor,metrics-source,\", instance=\"node-04.aws-eu-central-1a.nimbus.test\", job=\"netdata\", source=\"slave-01.aws-eu-central-1a.metrics.hq\"}

Labels include fleets and hostnames. Each DC has one prometheus instance and each pulls just for that DC.

jakubgs commented 3 years ago

Looks like it calmed down for the most part on one of the hosts 30 minutes after restart. The other one still spamming.

I might add that all Prometheus instances push to the same cluster, but the one Cortex instance that has not been restarted does not log any warnings like that. Zero.

jakubgs commented 3 years ago

You may want to use external_labels in your Prometheus config to fix this

# The labels to add to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
[ <labelname>: <labelvalue> ... ]

What would be the point of that if they are already labeled with fleet, datacenter, and instance?

The issue is not about uniqueness. If it was I would be getting these warnings ALL THE TIME. But I do not. I get them only when i restart a Cortex instance, and only on that instance. Why would only it get duplicates if my Prometheus instances have all 3 configured to push to?

jakubgs commented 3 years ago

Like right now, the instance I restarted yesterday is fine:

 > sudo journalctl -f -u cortex
-- Logs begin at Sun 2020-11-01 01:26:06 UTC. --
level=info ts=2020-11-05T09:35:20.375961889Z caller=table_manager.go:333 msg="synching tables" expected_tables=30
level=info ts=2020-11-05T09:37:20.376293268Z caller=table_manager.go:333 msg="synching tables" expected_tables=30
level=info ts=2020-11-05T09:39:20.376143806Z caller=table_manager.go:333 msg="synching tables" expected_tables=30
level=info ts=2020-11-05T09:41:20.376198511Z caller=table_manager.go:333 msg="synching tables" expected_tables=30
level=info ts=2020-11-05T09:43:20.376095958Z caller=table_manager.go:333 msg="synching tables" expected_tables=30
level=info ts=2020-11-05T09:45:20.376199033Z caller=table_manager.go:333 msg="synching tables" expected_tables=30
level=info ts=2020-11-05T09:47:20.376277278Z caller=table_manager.go:333 msg="synching tables" expected_tables=30
level=info ts=2020-11-05T09:49:20.376164123Z caller=table_manager.go:333 msg="synching tables" expected_tables=30
level=info ts=2020-11-05T09:51:20.376190972Z caller=table_manager.go:333 msg="synching tables" expected_tables=30
level=info ts=2020-11-05T09:53:20.376224666Z caller=table_manager.go:333 msg="synching tables" expected_tables=30

So if the issue was really that I get duplicates from different DCs these logs would be FULL of sample timestamp out of order.

Akhalaka commented 3 years ago

We are experiencing this issue as well. Same symptoms as described above; running perfectly happy for months and after having an issue with the etcd database, cortex was brought back into working order and now there are many "sample timestamp out of order for series" messages from all our ingesters (lasting for at least 6 hours now). According to the prometheus instance suspected for sending this traffic, its around 0.1% of series are being rejected based on the prometheus_remote_storage_failed_samples_total metric (but enough to trigger our alerting).

Unsure on if this is relevant, but an interesting note is that cortex is set up for multi-tenancy and as far as we can tell, all the series that get flagged for out of order are coming from our largest tenant and the prometheus mentioned above. All of the smaller users (by at least an order of magnitude) are not seeing any series rejected due to this out of order message.

bboreham commented 3 years ago

Today it was my turn to debug "sample timestamp out of order" errors. My timestamps were about half a second apart, e.g. last timestamp: 1609766459.269, incoming timestamp: 1609766458.639. There was only one Prometheus pod configured to send.

However, I discovered there were actually two Prometheus sending! Looking at prometheus_build_info, there was an extra one, in a pod that is not reported by Kubernetes. I found that the older Prometheus was on a node that had "gone bad" somehow, showing NodeStatusUnknown; Kubelet stopped posting node status., so it seems Kubernetes gave up on the pod and started another Prometheus. Restarting kubelet on the failed node brought it back, and the unwanted container was shut down.

portalsacc commented 3 years ago

I see the same issue, am getting the errors related to out of bounds and duplicate sample for timestamp. My cortex environment is still in POC stage, so I shutdown for an hour and redeployed. Even then I can still see these errors.

I would like to request answers for below questions,

  1. The necessary arguments while deploying the ingester
  2. I can see the datapoints in Grafana for every 15s, but I am expecting for every 60seconds. May I know which argument helps me to have data point every minute
  3. Like Metrictank, do we also have something similar to storage-schemas

current architecture: consul - distributor - ingester - querier - queryfrontend - memcached - storegateway and nginx
backend -- long term storage s3 (blocks)

pracucci commented 3 years ago

duplicate sample for timestamp

This typically happens when you have clashing series exported by the same exact exporter. We've seen cAdvisor exporting clashing series in some conditions.

portalsacc commented 3 years ago

If it is the clashing series from the same exporter, how to address that? A separate telegraf microservice (the only writer to cortex) is exporting metrics to cortex as of now, I didn't get why I see the below errors:

Out of order sample Duplicate sample for timestamp Out of bounds

Also, can someone help me to ingest metrics for every minute interval (rather than datapoints for every 15s, I am interested in for every 60secs).

pracucci commented 3 years ago

Some general guidance.

Out of order sample

It means: given a single time-series, Cortex has received a sample with a timestamp older than the latest timestamp received. Cortex requires samples to pushed in timestamp order.

Duplicate sample for timestamp

It means: given a single time-series, Cortex has received two samples with the same exact timestamp but different value. It's not allowed in Cortex.

Out of bounds

It means: Cortex has received a sample with a timestamp "too far in the past" (eg. in the blocks storage you can ingest samples up to 1h old compared to the most recent timestamp received for the same tenant).

nehaev commented 3 years ago

Cortex requires samples to pushed in timestamp order

Are there any plans to completely eliminate or at least relax this limitation? The existing behavior is especially problematic for Loki as it leads to log records loss that can not be entirely fixed on the client side.

bboreham commented 3 years ago

No, timeseries are compressed by computing the delta from the previous sample, so re-ordering requires exponentially more work. I don't see why this would constrain Loki though.

DandyDeveloper commented 3 years ago

@bboreham Just a side note to this, what's the general fix / expectation Prom side for these errors?

I have a tenancy proxy inbetween my Prometheus and Cortex, and when I have a bad series, this effectively backlogs and completely stop ingestion on the node until I have manually removed the series using the admin API on Prometheus.

Is there an expectation or option in Prometheus to drop the remote_write for a series if it gets too many failures?

In my case, I have a series that basically keeps sending and has flooded my proxy.

bboreham commented 3 years ago

@DandyDeveloper the message in the original description says Code(400) - such errors are dropped immediately by Prometheus and not re-tried. No reason for it to cause a backlog.

If your symptoms are different please open a different issue.

bboreham commented 3 years ago

Closing this as I don't think there is anything to fix; it can serve as documentation if people need to dig in to what "sample out of order" does or doesn't mean.

disfluxly commented 3 years ago

Some general guidance.

Out of order sample

It means: given a single time-series, Cortex has received a sample with a timestamp older than the latest timestamp received. Cortex requires samples to pushed in timestamp order.

Duplicate sample for timestamp

It means: given a single time-series, Cortex has received two samples with the same exact timestamp but different value. It's not allowed in Cortex.

Out of bounds

It means: Cortex has received a sample with a timestamp "too far in the past" (eg. in the blocks storage you can ingest samples up to 1h old compared to the most recent timestamp received for the same tenant).

@pracucci - Sorry to put this on a closed thread, but figured it fits for documentation purposes.

What is the "order" (pun intended) that these errors are raised in? For example, suppose I have a sample that is out of bounds, out of order, AND is a duplicate sample?

bboreham commented 3 years ago

A single timestamp can't be both duplicate and out of order.

Out of bounds is checked before out of order, here: https://github.com/cortexproject/cortex/blob/12d1cb1b1611cc6a480b72b318fe7407ef2d7305/vendor/github.com/prometheus/prometheus/tsdb/head.go#L1319-L1357

disfluxly commented 3 years ago

A single timestamp can't be both duplicate and out of order.

@bboreham - What about something like this sequence? Obviously data integrity issues aside, but curious what Cortex error would come up with something like this:

cpu_usage{"label"="A"} 0.5 1623939771029
cpu_usage{"label"="A"} 0.7 1623939771090
cpu_usage{"label"="A"} 0.6 1623939771029

Would that third sample be: "out of order" or "duplicate timestamp with new value" ?

Likewise, if for some reason the same metric(s) got processed twice:

cpu_usage{"label"="A"} 0.5 1623939771029
cpu_usage{"label"="A"} 0.7 1623939771090
cpu_usage{"label"="A"} 0.5 1623939771029
cpu_usage{"label"="A"} 0.7 1623939771090

What would the third & fourth samples be?

bboreham commented 3 years ago

Would that third sample be: "out of order" or "duplicate timestamp with new value" ?

Out of order. It only keeps the last sample around for comparison; everything else is highly compressed and unavailable unless you run a query.

danfromtitan commented 3 years ago

Came acoss "err: out of order sample" in the ingester following a Cortex maintainance window. During this time Prometheus was denied access to Cortex with the expectation that after maintainance and within 2 hours, Prometheus will restore the missing values from its WAL.

It looks like for the 5 minutes after Prometheus stops sending, Cortex replicates the last know value in the time series, effectivelly filling in bogus data.

After Prometheus connectivity is restored, Cortex will fill in the later values that were missing, but the 5-mins auto-generated values remain and colide with the real values sent by Prometheus, thus the "err: out of order sample".

This is all illustrated in the image below:

image

I wonder if someone can shed some light on why Cortex maintains an interval of bogus values. If that is to approximate for possible missing values in the time series, why aren't the values coming from Prometheus allowed to overwrite after the fact.

bboreham commented 3 years ago

Filling in empty values for up to 5 minutes is standard PromQL - https://www.robustperception.io/staleness-and-promql No "bogus data" is filled in the store; this is done at query time.

It is interesting to ponder why Prometheus did not supply values during that time; perhaps there are clues in your Prometheus logs.

(Please open a new issue if you want to continue the conversation)

danfromtitan commented 3 years ago

Thanks for the insights, you are right, the Cortex query now shows the increase pattern instead of a flat value, I guess there were cached query results lingering right after recovery that were leading me to believe otherwise.