Sidecar stopped submitting stats to StackDriver abruptly within minutes of start even as Prometheus has all metrics

varun-krishna commented 3 years ago

We have a stackdriver-prometheus-sidecar working on our RKE cluster along with Prometheus containers . Prometheus continued collecting data without issues. However, the sidecar stopped submitting data to StackDriver. We can manually query the data from Prometheus without issues, there's no missing data.

Prometheus Logs

level=info ts=2021-02-05T06:30:28.372Z caller=main.go:302 msg="No time or size retention was set so using the default time retention" duration=15d
level=info ts=2021-02-05T06:30:28.372Z caller=main.go:337 msg="Starting Prometheus" version="(version=2.19.0, branch=HEAD, revision=5d7e3e970602c755855340cb190a972cebdd2ebf)"
level=info ts=2021-02-05T06:30:28.373Z caller=main.go:338 build_context="(go=go1.14.4, user=root@d4cf5c7e268d, date=20200609-10:29:59)"
level=info ts=2021-02-05T06:30:28.373Z caller=main.go:339 host_details="(Linux 5.4.0-54-generic #60-Ubuntu SMP Fri Nov 6 10:37:59 UTC 2020 x86_64 prometheus-deployment-7bf4865cc7-6gqj8 (none))"
level=info ts=2021-02-05T06:30:28.373Z caller=main.go:340 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2021-02-05T06:30:28.373Z caller=main.go:341 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2021-02-05T06:30:28.376Z caller=main.go:678 msg="Starting TSDB ..."
level=info ts=2021-02-05T06:30:28.376Z caller=web.go:524 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2021-02-05T06:30:28.383Z caller=head.go:645 component=tsdb msg="Replaying WAL and on-disk memory mappable chunks if any, this may take a while"
level=info ts=2021-02-05T06:30:28.384Z caller=head.go:706 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
level=info ts=2021-02-05T06:30:28.384Z caller=head.go:709 component=tsdb msg="WAL replay completed" duration=1.003615ms
level=info ts=2021-02-05T06:30:28.387Z caller=main.go:694 fs_type=EXT4_SUPER_MAGIC
level=info ts=2021-02-05T06:30:28.387Z caller=main.go:695 msg="TSDB started"
level=info ts=2021-02-05T06:30:28.388Z caller=main.go:799 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2021-02-05T06:30:28.394Z caller=kubernetes.go:253 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2021-02-05T06:30:28.406Z caller=kubernetes.go:253 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2021-02-05T06:30:28.423Z caller=kubernetes.go:253 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2021-02-05T06:30:28.426Z caller=kubernetes.go:253 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2021-02-05T06:30:28.428Z caller=kubernetes.go:253 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2021-02-05T06:30:28.430Z caller=main.go:827 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2021-02-05T06:30:28.430Z caller=main.go:646 msg="Server is ready to receive web requests."

Sidecar Logs

level=debug ts=2021-02-05T06:34:36.044Z caller=series_cache.go:408 component="Prometheus reader" msg="metadata not found" metric_name=scrape_series_added 
level=debug ts=2021-02-05T06:34:38.599Z caller=series_cache.go:408 component="Prometheus reader" msg="metadata not found" metric_name=scrape_series_added 
level=debug ts=2021-02-05T06:34:38.912Z caller=series_cache.go:408 component="Prometheus reader" msg="metadata not found" metric_name=scrape_series_added 
level=debug ts=2021-02-05T06:34:43.857Z caller=queue_manager.go:306 component=queue_manager msg=QueueManager.calculateDesiredShards samplesIn=606.8021371305438 samplesOut=609.5620942240699 samplesOutDuration=2.971254286683822e+08 timePerSample=487440.7898453761 sizeRate=24130.800841719887 offsetRate=30705.16034695124 desiredShards=0.5861631819054262 
level=debug ts=2021-02-05T06:34:43.857Z caller=queue_manager.go:317 component=queue_manager msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.5861631819054262 upperBound=1.1 
level=debug ts=2021-02-05T06:34:44.026Z caller=series_cache.go:408 component="Prometheus reader" msg="metadata not found" metric_name=scrape_series_added 
level=debug ts=2021-02-05T06:34:44.027Z caller=series_cache.go:408 component="Prometheus reader" msg="metadata not found" metric_name=scrape_series_added 
level=debug ts=2021-02-05T06:34:44.158Z caller=client.go:202 component=storage msg="Partial failure calling CreateTimeSeries" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: timeSeries[4-7]" 
level=debug ts=2021-02-05T06:34:44.159Z caller=client.go:213 component=storage summary="total_point_count:200 success_point_count:196 errors:<status:<code:9 > point_count:4 > " 
level=warn ts=2021-02-05T06:34:44.159Z caller=queue_manager.go:534 component=queue_manager msg="Unrecoverable error sending samples to remote storage" err="rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: timeSeries[4-7]" 
level=debug ts=2021-02-05T06:34:44.687Z caller=series_cache.go:408 component="Prometheus reader" msg="metadata not found" metric_name=scrape_series_added 
level=debug ts=2021-02-05T06:34:44.687Z caller=series_cache.go:408 component="Prometheus reader" msg="metadata not found" metric_name=scrape_series_added 
level=debug ts=2021-02-05T06:34:45.764Z caller=series_cache.go:408 component="Prometheus reader" msg="metadata not found" metric_name=scrape_series_added 
level=debug ts=2021-02-05T06:34:45.765Z caller=series_cache.go:408 component="Prometheus reader" msg="metadata not found" metric_name=scrape_series_added 
level=debug ts=2021-02-05T06:34:45.836Z caller=series_cache.go:408 component="Prometheus reader" msg="metadata not found" metric_name=scrape_series_added 
level=warn ts=2021-02-05T06:34:45.849Z caller=manager.go:247 component="Prometheus reader" msg="Failed to build sample" err="get series information: unexpected metric name suffix \"_bucket\"" 
level=warn ts=2021-02-05T06:34:45.870Z caller=manager.go:247 component="Prometheus reader" msg="Failed to build sample" err="get series information: unexpected metric name suffix \"_bucket\"" 
level=warn ts=2021-02-05T06:34:45.911Z caller=manager.go:247 component="Prometheus reader" msg="Failed to build sample" err="get series information: unexpected metric name suffix \"_bucket\"" 
level=warn ts=2021-02-05T06:34:45.992Z caller=manager.go:247 component="Prometheus reader" msg="Failed to build sample" err="get series information: unexpected metric name suffix \"_bucket\"" 
level=warn ts=2021-02-05T06:34:46.153Z caller=manager.go:247 component="Prometheus reader" msg="Failed to build sample" err="get series information: unexpected metric name suffix \"_bucket\"" 
level=warn ts=2021-02-05T06:34:46.474Z caller=manager.go:247 component="Prometheus reader" msg="Failed to build sample" err="get series information: unexpected metric name suffix \"_bucket\"" 
level=warn ts=2021-02-05T06:34:47.115Z caller=manager.go:247 component="Prometheus reader" msg="Failed to build sample" err="get series information: unexpected metric name suffix \"_bucket\"" 
level=warn ts=2021-02-05T06:34:48.396Z caller=manager.go:247 component="Prometheus reader" msg="Failed to build sample" err="get series information: unexpected metric name suffix \"_bucket\"" 
level=warn ts=2021-02-05T06:34:50.957Z caller=manager.go:247 component="Prometheus reader" msg="Failed to build sample" err="get series information: unexpected metric name suffix \"_bucket\"" 
level=warn ts=2021-02-05T06:34:54.958Z caller=manager.go:247 component="Prometheus reader" msg="Failed to build sample" err="get series information: unexpected metric name suffix \"_bucket\"" 
level=debug ts=2021-02-05T06:34:58.857Z caller=queue_manager.go:306 component=queue_manager msg=QueueManager.calculateDesiredShards samplesIn=535.7883763711017 samplesOut=537.9963420459226 samplesOutDuration=2.6106632576137245e+08 timePerSample=485256.6929518049 sizeRate=21518.587340042577 offsetRate=25192.16827756099 desiredShards=0.47529673784104187 
level=debug ts=2021-02-05T06:34:58.857Z caller=queue_manager.go:317 component=queue_manager msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.47529673784104187 upperBound=1.1 
level=warn ts=2021-02-05T06:34:58.958Z caller=manager.go:247 component="Prometheus reader" msg="Failed to build sample" err="get series information: unexpected metric name suffix \"_bucket\"" 
level=debug ts=2021-02-05T06:35:13.857Z caller=queue_manager.go:306 component=queue_manager msg=QueueManager.calculateDesiredShards samplesIn=428.71070109688134 samplesOut=430.46374030340473 samplesOutDuration=2.104666592224313e+08 timePerSample=488930.0526778112 sizeRate=20049.21653870073 offsetRate=20153.734622048793 desiredShards=0.3168601665758903 
level=debug ts=2021-02-05T06:35:13.857Z caller=queue_manager.go:317 component=queue_manager msg=QueueManager.updateShardsLoop lowerBound=0.7 desiredShards=0.3168601665758903 upperBound=1.1 
level=warn ts=2021-02-05T06:35:14.961Z caller=manager.go:247 component="Prometheus reader" msg="Failed to build sample" err="get series information: unexpected metric name suffix \"_bucket\"" 
level=warn ts=2021-02-05T06:35:18.964Z caller=manager.go:247 component="Prometheus reader" msg="Failed to build sample" err="get series information: unexpected metric name suffix \"_bucket\"" 
level=warn ts=2021-02-05T06:35:22.965Z caller=manager.go:247 component="Prometheus reader" msg="Failed to build sample" err="get series information: unexpected metric name suffix \"_bucket\"" 
level=warn ts=2021-02-05T06:35:26.966Z caller=manager.go:247 component="Prometheus reader" msg="Failed to build sample" err="get series information: unexpected metric name suffix \"_bucket\"" 
level=debug ts=2021-02-05T06:35:28.857Z caller=queue_manager.go:306 component=queue_manager msg=QueueManager.calculateDesiredShards samplesIn=342.96856087750507 samplesOut=344.3843255760571 samplesOutDuration=1.6922181129794502e+08 timePerSample=491374.893485309 sizeRate=16976.546564293916 offsetRate=16122.987697639033 desiredShards=0.26617199471825403

Any insights here would be really helpfull

varun-krishna commented 3 years ago

Hey friends, @jkohen @fabxc @StevenYCChou @qingling128 @nmamadeo Any pointers here would be really helpfull. This issue is blocking one of our release as no metrics are being shipped to Stackdriver

qingling128 commented 3 years ago

Hi @varun-krishna , thanks for providing the feedback. The Stackdriver Prometheus sidecar is designed for GKE use cases. Unfortunately RKE is not a supported platform by us. If the issue is reproducible on GKE, feel free to reopen this bug with the reproduction steps (e.g. GKE version, sidecar version, yaml config etc.), so we can take a closer look.

ethai commented 2 years ago

@qingling128 I'm running this sidecar in GKE and hitting this exact here

level=warn ts=2021-08-28T04:43:19.878Z caller=manager.go:247 component="Prometheus reader" msg="Failed to build sample" err="get series information: unexpected metric name suffix \"_bucket\"

let me know what other info you may need in order to assist.

Running sidecar 0.8.2

I made attempt to startup the sidecar with this filtering args - a shot in the dark

    - --include={job=~"kubernetes-pods|kubernetes-service-endpoints|kubernetes-service-endpoints-slow|kubernetes-pods-slow"}
    - --log.level=debug
    - --include=metric_name{label!~".*bucket.*"}

Stackdriver / stackdriver-prometheus-sidecar

Sidecar stopped submitting stats to StackDriver abruptly within minutes of start even as Prometheus has all metrics #272