Stackdriver / stackdriver-prometheus-sidecar

A sidecar for the Prometheus server that can send metrics to Stackdriver.
https://cloud.google.com/monitoring/kubernetes-engine/prometheus
Apache License 2.0
120 stars 43 forks source link

Tailing WAL failed #160

Closed castlemilk closed 4 years ago

castlemilk commented 5 years ago

This is specifically not working on the 0.5.1 release (so far tested), 0.4.1 is working.

I believe there's an issue relating to how the path for the checkpoint file is being resolved. The culprit is the following:

https://github.com/Stackdriver/stackdriver-prometheus-sidecar/blob/3e6c59f370d26797abbcbb36d9684270c72a0e3b/tail/tail.go#L65

Why are we joining dir and cpdir.

as we set something like --prometheus.wal-directory=/prometheus/wal which is set for dir and cpdir is returned fromtsdb.LastCheckpoint(dir), which will return the full path of the checkpoint found. So we end up concatenating to the following:

/prometheus/wal + /prometheus/wal/checkpoint.002016

I'm getting the following errors in this scenario:

level=info ts=2019-09-02T21:52:23.197Z caller=main.go:303 msg="Starting Stackdriver Prometheus sidecar" version="(version=, branch=, revision=)"
level=info ts=2019-09-02T21:52:23.197Z caller=main.go:304 build_context="(go=go1.12, user=, date=)"
level=info ts=2019-09-02T21:52:23.197Z caller=main.go:305 host_details="(Linux 4.14.127+ #1 SMP Tue Jun 18 23:08:40 PDT 2019 x86_64 prometheus-777fd6c946-hrchs (none))"
level=info ts=2019-09-02T21:52:23.197Z caller=main.go:306 fd_limits="(soft=1048576, hard=1048576)"
level=error ts=2019-09-02T21:52:23.211Z caller=main.go:394 msg="Tailing WAL failed" err="open checkpoint: list segment in dir:/prometheus/wal/prometheus/wal/checkpoint.002016: open /prometheus/wal/prometheus/wal/checkpoint.002016: no such file or directory"

Can someone advise if I'm just configuring this incorrectly or not?

The configuration i'm using for the sidecar is as follows

- name: sidecar
    image: gcr.io/xxx-xxx-xxx/stackdriver-prometheus/stackdriver-prometheus-sidecar:0.5.1
    imagePullPolicy: Always
    args:
    - "--stackdriver.project-id=xxx-xxx-xxx"
    - "--prometheus.wal-directory=/prometheus/wal"
    - "--stackdriver.kubernetes.location=x-xxx1"
    - "--stackdriver.kubernetes.cluster-name=xxx-xxx-xxx"
    - "--stackdriver.generic.location=xxx-xxx-xxx"
    ports:
    - name: sidecar
      containerPort: 9091
    volumeMounts:
    - name: storage-volume
      mountPath: /prometheus
mans0954 commented 5 years ago

I've just encountered the same issue:

 kubectl logs -n stackdriver  gke-01-prometheus-server-6f87676b88-2mkbp -c sidecar
level=info ts=2019-09-03T10:22:10.885Z caller=main.go:303 msg="Starting Stackdriver Prometheus sidecar" version="(version=, branch=, revision=)"
level=info ts=2019-09-03T10:22:10.885Z caller=main.go:304 build_context="(go=go1.12.9, user=, date=)"
level=info ts=2019-09-03T10:22:10.885Z caller=main.go:305 host_details="(Linux 4.14.127+ #1 SMP Tue Jun 18 18:32:10 PDT 2019 x86_64 gke-01-prometheus-server-6f87676b88-2mkbp (none))"
level=info ts=2019-09-03T10:22:10.885Z caller=main.go:306 fd_limits="(soft=1048576, hard=1048576)"
level=error ts=2019-09-03T10:22:10.890Z caller=main.go:394 msg="Tailing WAL failed" err="open checkpoint: list segment in dir:/data/wal/data/wal/checkpoint.000041: open /data/wal/data/wal/checkpoint.000041: no such file or directory"

Config:

  - args:
    - --stackdriver.project-id=xxx-xxx-xxx
    - --prometheus.wal-directory=/data/wal
    - --include={__name__=~".+",job="kubernetes-pods"}
    - --include={__name__=~".+",job="gcp-clsi"}
    imagePullPolicy: Always
    name: sidecar
    ports:
    - containerPort: 9091
      name: sidecar
      protocol: TCP
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /data
      name: storage-volume
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: gke-01-prometheus-server-token-hdt4z
      readOnly: true
mans0954 commented 5 years ago

Looks like it was changed in this commit: https://github.com/Stackdriver/stackdriver-prometheus-sidecar/commit/d8e80a47d54eb2fdd95e51c0ac65066cf1471350#diff-ebda32def4aa9fd929c691ab34ebcc47R62