Closed amckinley closed 4 years ago
Thanks for the report. Kernel is not part of distributed docker images, and is completely in your control.
Based on https://github.com/golang/go/wiki/LinuxKernelSignalVectorBug, there is a possibility that increasing amount of memory that program can lock (ulimit -l) will resolve the problem. Another possibility is to use GODEBUG=asyncpreemptoff=1
which however doesn’t fix the problem, only makes it less likely to occur.
If possible, updating Linux kernel may also be a solution.
The bug was fixed in Linux kernel versions 5.3.15, 5.4.2, and 5.5 and later.
Kernel is not part of distributed docker images, and is completely in your control.
Oh duh; my mistake. We'll look into upgrading the kernel on the k8s nodes hosting our Cortex pods, or try your other suggestions if we can't accomplish that. Assuming there's nothing else to look at on your end, feel free to close this out.
Actually, I'm now convinced that the message at the bottom of the stack trace is unrelated. We're running a kernel version that shouldn't have this issue: 5.3.0-3-amd64 #1 SMP Debian 5.3.15-1 (2019-12-07) x86_64 GNU/Linux
I think this is actually a bug in the Prometheus version that Cortex is using, which was fixed by this commit: https://github.com/prometheus/prometheus/commit/d30492cbb0ec781811e9cbc1c7fb4603b3e33606#diff-6fb13507bfadf9819fea5bda61f599e6
Is it difficult to update Cortex's dependency on Prometheus? I think you at least need to cherry-pick that^^ one fix.
That commit fixes the case when WAL is corrupted, and some WAL segments have been deleted as a result. You should see log message about that.
Cortex tracks Prometheus master quite closely, ~so this bugfix will be part of Cortex very soon. (Our last Prometheus update to latest master was 5 days ago, bugfix got merged the day after)~
Actually, this bugfix is already in Cortex master.
I guess we can consider this issue closed by https://github.com/cortexproject/cortex/pull/2902, but please feel free to re-open it if that's not the case. Thanks!
Does cortex usually cherry pick important fixes like this into the releases or do we just have to wait for an other release to be cut?(prefer not deploying off master here)
The fix for this issue(which definitely seems to be the prometheus commit linked above) is not present in 1.2.0 and is only available on the master.
Does cortex usually cherry pick important fixes like this into the releases or do we just have to wait for an other release to eb cut?(prefer not deploying off master here)
Cortex doesn’t cherry-pick bugfixes for experimental features into existing release branches.
Release process for Cortex 1.3.0 should start sometime next week, with final release likely the week after.
Note that there is a way to get rid of this problem: by removing chunks_head
directory inside ingester, in per-tenant TSDB directory. It is safe to delete this directory while ingester is NOT running. No data is lost in the process.
Just observed this in our Cortex development cluster. I'm guessing this is an issue with the kernel version that's being distributed with the
cortexproject/cortex:v1.2.0
docker image.Running Cortex v1.2.0 on Kubernetes, as created by Grafana jsonnet libs. Full cortex config: