Stackdriver / stackdriver-prometheus

Prometheus support for Stackdriver
https://cloud.google.com/monitoring/kubernetes-engine/prometheus
Apache License 2.0
19 stars 12 forks source link

release-0.4.1 crashed #5

Closed jiang-wei closed 6 years ago

jiang-wei commented 6 years ago

What did you do? Run prometheus in GKE 1.10 What did you expect to see?

What did you see instead? Under which circumstances? Sometimes it works just ok. But after a while it just crashes and GKE will pull it up.

The same thing happens to release-0.4.2 as well.

I doubt there are some dirty metric pages prometheus does not parse very well.

Here is the log

level=info ts=2018-05-24T16:21:11.015022187Z caller=main.go:164 msg="Starting Stackdriver Prometheus" version="(version=0.4.1, branch=release-0.4.1, revision=4c0cbc9246f4c761b05d82baf6222091315fa3bb)"
level=info ts=2018-05-24T16:21:11.015154747Z caller=main.go:165 build_context="(go=go1.9, user=bmoyles@bmoyles-macbookpro.roam.corp.google.com, date=20180424-19:45:39)"
level=info ts=2018-05-24T16:21:11.015184187Z caller=main.go:166 host_details="(Linux 4.4.111+ #1 SMP Thu Feb 1 22:06:37 PST 2018 x86_64 prometheus-558c885b49-ws4wk (none))"
level=info ts=2018-05-24T16:21:11.015213247Z caller=main.go:167 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-05-24T16:21:11.313119833Z caller=web.go:332 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-05-24T16:21:11.314709866Z caller=main.go:402 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2018-05-24T16:21:11.412186637Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-05-24T16:21:11.413992788Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-05-24T16:21:11.512267738Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-05-24T16:21:11.513078293Z caller=main.go:348 msg="Server is ready to receive requests."
level=info ts=2018-05-24T16:21:11.513157405Z caller=manager.go:58 component="scrape manager" msg="Starting scrape manager..."
level=info ts=2018-05-24T16:21:11.513832398Z caller=main.go:367 msg="Stackdriver client started"

level=debug ts=2018-05-24T16:04:43.708021696Z caller=client.go:148 component=remote msg="sending request to Stackdriver"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x15f8007]

goroutine 214 [running]:
github.com/Stackdriver/stackdriver-prometheus/retrieval.subtractResetValue(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc1, 0x0, 0xc4213cff80)
    /Users/bmoyles/go/src/github.com/Stackdriver/stackdriver-prometheus/retrieval/scrape.go:965 +0x177
github.com/Stackdriver/stackdriver-prometheus/retrieval.(*pointExtractor).UpdateValue(0xc420b571a0, 0xc4209bc550, 0xc4213cff80, 0xc420d3a600, 0xc42142a378, 0x1, 0x1)
    /Users/bmoyles/go/src/github.com/Stackdriver/stackdriver-prometheus/retrieval/scrape.go:893 +0x5a9
github.com/Stackdriver/stackdriver-prometheus/retrieval.(*scrapeLoop).append(0xc420bd59a0, 0xc421512000, 0x6a4c, 0x9ab9, 0xbeb9d687525f322a, 0x1506577f6d, 0x2a57400, 0x0, 0x0, 0x0, ...)
    /Users/bmoyles/go/src/github.com/Stackdriver/stackdriver-prometheus/retrieval/scrape.go:675 +0x56d
github.com/Stackdriver/stackdriver-prometheus/retrieval.(*scrapeLoop).run(0xc420bd59a0, 0xdf8475800, 0x2540be400, 0x0)
    /Users/bmoyles/go/src/github.com/Stackdriver/stackdriver-prometheus/retrieval/scrape.go:599 +0x6a3
created by github.com/Stackdriver/stackdriver-prometheus/retrieval.(*scrapePool).sync
    /Users/bmoyles/go/src/github.com/Stackdriver/stackdriver-prometheus/retrieval/scrape.go:309 +0x2fc

Environment GKE 1.10.2 image: gcr.io/stackdriver-prometheus/stackdriver-prometheus:release-0.4.1


* Alertmanager configuration file:

insert configuration here (if relevant to the issue)


* Logs:

level=debug ts=2018-05-24T16:04:43.708021696Z caller=client.go:148 component=remote msg="sending request to Stackdriver" panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x15f8007]

goroutine 214 [running]: github.com/Stackdriver/stackdriver-prometheus/retrieval.subtractResetValue(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc1, 0x0, 0xc4213cff80) /Users/bmoyles/go/src/github.com/Stackdriver/stackdriver-prometheus/retrieval/scrape.go:965 +0x177 github.com/Stackdriver/stackdriver-prometheus/retrieval.(pointExtractor).UpdateValue(0xc420b571a0, 0xc4209bc550, 0xc4213cff80, 0xc420d3a600, 0xc42142a378, 0x1, 0x1) /Users/bmoyles/go/src/github.com/Stackdriver/stackdriver-prometheus/retrieval/scrape.go:893 +0x5a9 github.com/Stackdriver/stackdriver-prometheus/retrieval.(scrapeLoop).append(0xc420bd59a0, 0xc421512000, 0x6a4c, 0x9ab9, 0xbeb9d687525f322a, 0x1506577f6d, 0x2a57400, 0x0, 0x0, 0x0, ...) /Users/bmoyles/go/src/github.com/Stackdriver/stackdriver-prometheus/retrieval/scrape.go:675 +0x56d github.com/Stackdriver/stackdriver-prometheus/retrieval.(scrapeLoop).run(0xc420bd59a0, 0xdf8475800, 0x2540be400, 0x0) /Users/bmoyles/go/src/github.com/Stackdriver/stackdriver-prometheus/retrieval/scrape.go:599 +0x6a3 created by github.com/Stackdriver/stackdriver-prometheus/retrieval.(scrapePool).sync /Users/bmoyles/go/src/github.com/Stackdriver/stackdriver-prometheus/retrieval/scrape.go:309 +0x2fc

jiang-wei commented 6 years ago

BTW I'm using

            <id>prometheus</id>                                                            
            <dependencies>
                <dependency>
                    <groupId>io.prometheus</groupId>
                    <artifactId>simpleclient</artifactId>
                    <version>${prometheus-simpleclient.version}</version>                  
                </dependency>                                                              
                <dependency>
                    <groupId>io.prometheus</groupId>
                    <artifactId>simpleclient_servlet</artifactId>
                    <version>${prometheus-simpleclient.version}</version>                  
                </dependency>                                                              
                <dependency>
                    <groupId>io.prometheus</groupId>
                    <artifactId>simpleclient_dropwizard</artifactId>
                    <version>${prometheus-simpleclient.version}</version>                  
                </dependency>                                                              
            </dependencies>    

prometheus-simpleclient.version = 0.4.0 to expose metrics

jiang-wei commented 6 years ago

more info:

it seems some metrics may cause the crash.

promtool check metrics
jvm_buffers_direct_count non-histogram and non-summary metrics should not have "_count" suffix
jvm_buffers_mapped_count non-histogram and non-summary metrics should not have "_count" suffix
jvm_garbage_PS_MarkSweep_count non-histogram and non-summary metrics should not have "_count" suffix
jvm_garbage_PS_Scavenge_count non-histogram and non-summary metrics should not have "_count" suffix
jvm_threads_blocked_count non-histogram and non-summary metrics should not have "_count" suffix
jvm_threads_count non-histogram and non-summary metrics should not have "_count" suffix
jvm_threads_daemon_count non-histogram and non-summary metrics should not have "_count" suffix
jvm_threads_deadlock_count non-histogram and non-summary metrics should not have "_count" suffix
jvm_threads_new_count non-histogram and non-summary metrics should not have "_count" suffix
jvm_threads_runnable_count non-histogram and non-summary metrics should not have "_count" suffix
jvm_threads_terminated_count non-histogram and non-summary metrics should not have "_count" suffix
jvm_threads_timed_waiting_count non-histogram and non-summary metrics should not have "_count" suffix
jvm_threads_waiting_count non-histogram and non-summary metrics should not have "_count" suffix
jkohen commented 6 years ago

I just released version 0.4.3 with PR #8 Thanks for the help!