some metrics not being updated

oliver-neubauer-ecobee commented 5 years ago

Hi, I have configured stackdriver-prometheus sidecar with the following args:

args:
- --stackdriver.project-id=<redacted>
- --prometheus.wal-directory=/prometheus/wal
- --stackdriver.kubernetes.location=us-central1
- --stackdriver.kubernetes.cluster-name=<redacted>
- --filter=stackdriver_export="true"

For the purposes of testing, I've labelled two metrics with 'stackdriver_export="true"'.

external.googleapis.com/prometheus/kube_pod_container_status_waiting_reason
external.googleapis.com/prometheus/node_load1

If I query prometheus, I can see timeseries data like:

kube_pod_container_status_waiting_reason{container="example-app",endpoint="http",instance="10.20.3.12:8080",job="kube-state-metrics",namespace="default",pod="example-app-b7fbf9fd9-8fnm8",reason="ImagePullBackOff",service="prom-kube-state-metrics",stackdriver_export="true"}   1

^ there are metrics like these for every pod.

But, when I query stackdriver directly, despite finding out that the metric type itself has been correctly created, I only see metrics for kube-state-metrics "pods", and the labels seem to be indicating that the "pod_name" is something that doesn't even exist. There are also only a few of them, and none of them indicate pods that are in a wait state, even though I have forced some into such a state and can see the metrics in prometheus. example:

Metric: external.googleapis.com/prometheus/kube_pod_container_status_waiting_reason
Label: container=kube-state-metrics
Label: reason=ImagePullBackOff
Label: stackdriver_export=true
Resource: k8s_container
Label: container_name=kube-state-metrics
Label: namespace_name=monitor
Label: location=us-central1
Label: project_id=<redacted>
Label: pod_name=prom-op-kube-state-metrics-76786cc9b4-dgph9
Label: cluster_name=<redacted>
Point: [1554410721-1554410721] = 0
Point: [1554410661-1554410661] = 0
Point: [1554410601-1554410601] = 0
Point: [1554410541-1554410541] = 0
Point: [1554410481-1554410481] = 0
Point: [1554410421-1554410421] = 0
Point: [1554410361-1554410361] = 0
Point: [1554410301-1554410301] = 0
Point: [1554410241-1554410241] = 0
Point: [1554410181-1554410181] = 0
Point: [1554410121-1554410121] = 0
Point: [1554410061-1554410061] = 0
Point: [1554410001-1554410001] = 0
Point: [1554409941-1554409941] = 0
Point: [1554409881-1554409881] = 0
Point: [1554409821-1554409821] = 0
Point: [1554409761-1554409761] = 0
Point: [1554409701-1554409701] = 0
Point: [1554409641-1554409641] = 0
Point: [1554409581-1554409581] = 0
Point: [1554409521-1554409521] = 0
Point: [1554409461-1554409461] = 0
Point: [1554409401-1554409401] = 0
Point: [1554409341-1554409341] = 0
Point: [1554409281-1554409281] = 0
Point: [1554409221-1554409221] = 0
Point: [1554409161-1554409161] = 0
Point: [1554409101-1554409101] = 0
Point: [1554409041-1554409041] = 0
Point: [1554408981-1554408981] = 0
Point: [1554408921-1554408921] = 0
Point: [1554408861-1554408861] = 0
Point: [1554408801-1554408801] = 0
Point: [1554408741-1554408741] = 0
Point: [1554408681-1554408681] = 0
Point: [1554408621-1554408621] = 0
Point: [1554408561-1554408561] = 0
Point: [1554408501-1554408501] = 0
Point: [1554408441-1554408441] = 0
Point: [1554408381-1554408381] = 0
Point: [1554408321-1554408321] = 0
Point: [1554408261-1554408261] = 0
Point: [1554408201-1554408201] = 0
Point: [1554408141-1554408141] = 0
Point: [1554408081-1554408081] = 0
Point: [1554408021-1554408021] = 0
Point: [1554407961-1554407961] = 0
Point: [1554407901-1554407901] = 0
Point: [1554407841-1554407841] = 0
Point: [1554407781-1554407781] = 0
Point: [1554407721-1554407721] = 0
Point: [1554407661-1554407661] = 0
Point: [1554407601-1554407601] = 0
Point: [1554407541-1554407541] = 0
Point: [1554407481-1554407481] = 0
Point: [1554407421-1554407421] = 0
Point: [1554407361-1554407361] = 0
Point: [1554407301-1554407301] = 0
Point: [1554407241-1554407241] = 0
Point: [1554407181-1554407181] = 0

I feel like I'm running into some kind of label re-write issue, maybe? Wondering if anyone can shed some light as to why the prometheus metrics are not being either properly exported, or maybe not properly ingested. There are no log messages from the sidecar itself to indicate a problem.

jkohen commented 5 years ago

Oliver, thanks for the report. I created an internal ticket to track this issue, and we'll follow up with you. Please follow our troubleshooting guide: https://cloud.google.com/monitoring/kubernetes-engine/prometheus#prometheus_integration_issues

Can you share the pod spec for Prometheus and the stackdriver-prometheus-sidecar, as well as the Prometheus Server configuration, so we can better assist you?

oliver-neubauer-ecobee commented 5 years ago

Hi, and thanks for the assist. The prometheus deployment is done via a fairly vanilla prometheus operator. config: https://pastebin.com/xjZDswjC As you might notice, the operator does a lot of relabeling. it's possible it is clobbering something it shouldn't, though it does seem to be sending out job and instance labels.

The pod description is here: https://pastebin.com/SZqjPipa

The config mentioned is currently:

 cat /etc/stackdriver/sd_sidecar-cfg.yaml 
static_metadata:
  - metric: kube_pod_container_status_waiting_reason
    type: gauge

But I only did that to troubleshoot after noticing problems. I've turned on debugging, but there's nothing interesting there. Lots of messages like this, but that's it:

level=debug ts=2019-04-05T15:20:52.079709947Z caller=queue_manager.go:318 component=queue_manager msg=QueueManager.caclulateDesiredShards samplesIn=0.14180953951198255 samplesOut=0.14543252968052636 samplesOutDuration=1.6101762665962696e+06 timePerSample=1.1071637618718218e+07 sizeRate=7806.8179571467535 offsetRate=7838.416706178706 desiredShards=0.0023693367774836665
level=debug ts=2019-04-05T15:20:52.079881565Z caller=queue_manager.go:329 component=queue_manager msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.0023693367774836665 upperBound=1.1

oliver-neubauer-ecobee commented 5 years ago

As a possible additional data point, I have a metric that is just flat-out not showing up in stackdriver.

kube_pod_container_resource_limits{container="fluentd-gcp",endpoint="http",instance="10.20.3.12:8080",job="kube-state-metrics",namespace="kube-system",node="gke-chronos-k8s-prom-dev-default-pool-9924e3fe-69gt",pod="fluentd-gcp-v3.1.1-wtzg2",resource="memory",service="prom-kube-state-metrics",stackdriver_export="true",unit="byte"} 524288000

jkohen commented 5 years ago

If that metrics is being sent to Stackdriver with all those labels, then it should be getting rejected, because you have more than 10 labels (11), and the sidecar should be logging about it. Fortunately most of the labels are redundant with the metrics we use to build the MonitoredResource (container, pod, etc), so once we resolve the main issue you are having, it should be easy to make the metrics conform.

Someone else will be handling this issue on our side. thanks for the additional information.

On Fri, Apr 5, 2019 at 11:25 AM oliver-neubauer-ecobee < notifications@github.com> wrote:

As a possible additional data point, I have a metric that is just flat-out not showing up in stackdriver.

kube_pod_container_resource_limits{container="fluentd-gcp",endpoint="http",instance="10.20.3.12:8080",job="kube-state-metrics",namespace="kube-system",node="gke-chronos-k8s-prom-dev-default-pool-9924e3fe-69gt",pod="fluentd-gcp-v3.1.1-wtzg2",resource="memory",service="prom-kube-state-metrics",stackdriver_export="true",unit="byte"} 524288000

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Stackdriver/stackdriver-prometheus-sidecar/issues/114#issuecomment-480317753, or mute the thread https://github.com/notifications/unsubscribe-auth/AG0TaBJ5LkoBSzTOgn7bfHM3jJpc-sQjks5vd2r6gaJpZM4cdsFe .

oliver-neubauer-ecobee commented 5 years ago

Ah. I was unaware of the 10 label limit! Thank you. Being able to trim labels with the stackdriver-prometheus-sidecar would be a great feature. In the meantime, I'll need to see how to get the autogenerated configs to limit extra/unneeded labels.

StevenYCChou commented 5 years ago

Hi, and thanks for the assist. The prometheus deployment is done via a fairly vanilla prometheus operator. config: https://pastebin.com/xjZDswjC As you might notice, the operator does a lot of relabeling. it's possible it is clobbering something it shouldn't, though it does seem to be sending out job and instance labels.

The pod description is here: https://pastebin.com/SZqjPipa

Hi Oliver,

Can you past the config and pod description above again? Those two links are no longer available, and sorry that I didn't capture it when you pasted it at that time.

Also wonder whether you will be able to limit extra/unneeded labels on your side?

oliver-neubauer-ecobee commented 5 years ago

Hi Steven I've been unable to limit labels, unfortunately. I also haven't put to much time into it yet. Unfortunately, to limit metrics ingestion, I actually need to add a label for the sidecar to filter. Adding the ability to trim out extra labels at the sidecar level would be a handy feature

The new links are: prometheus config: https://pastebin.com/5abMA2uH pod description (from describe): https://pastebin.com/Rgx43KdD pod yaml: https://pastebin.com/8ACcHr0q

StevenYCChou commented 5 years ago

Thanks @oliver-neubauer-ecobee for the information. I just added your 3 files into our internal tracking ticket, and we will follow up with you.

knyar commented 5 years ago

Sorry for the drive-by comment, but I think an easy way to decrease the number of labels might be to use recording rules. For example, you could do something like this:

record: kube_pod_container_memory_limit_bytes
expr: sum(kube_pod_container_resource_limits{resource="memory",unit="byte"}) without (resource, unit)

Note, that you will need to add static_metadata configuration for recording rules to be exported to SD.

StevenYCChou commented 4 years ago

Hi @oliver-neubauer-ecobee - just want to follow up on this.

Will what knyar@ commented in https://github.com/Stackdriver/stackdriver-prometheus-sidecar/issues/114#issuecomment-486161550 mitigate the issue you your faced? I would recommend aggregating labels away too.

StevenYCChou commented 4 years ago

Hi @oliver-neubauer-ecobee,

Since I haven't heard back from you in one week, I'm going to close this issue.

Thanks for the initial for the initial report and follow-ups, and please reopen the issue if you need further assistance in this topic.

Stackdriver / stackdriver-prometheus-sidecar

some metrics not being updated #114