AcalephStorage / consul-alerts

A simple daemon to send notifications based on Consul health checks
GNU General Public License v2.0
826 stars 191 forks source link

panic: runtime error: invalid memory address or nil pointer dereference (check-handler.go:152) #135

Open rhuddleston opened 8 years ago

rhuddleston commented 8 years ago

Getting these panic on several different instances of consul alerts:

time="2016-07-20T22:47:36Z" level=info msg="10.0.2.212::Serf Health Status is pending status change from passing to critical for 996.940431ms."
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x502520]

goroutine 24 [running]:
panic(0x99b5a0, 0xc820010080)
        /usr/local/go/src/runtime/panic.go:481 +0x3e6
github.com/AcalephStorage/consul-alerts/consul.(*ConsulAlertClient).updateHealthCheck(0xc82019e090, 0xc820077cc0, 0x9f, 0xc8204c5200)
        /vagrant/src/github.com/AcalephStorage/consul-alerts/consul/client.go:462 +0x13c0
github.com/AcalephStorage/consul-alerts/consul.(*ConsulAlertClient).UpdateCheckData(0xc82019e090)
        /vagrant/src/github.com/AcalephStorage/consul-alerts/consul/client.go:277 +0x718
main.(*CheckProcessor).handleChecks(0xc820199d40, 0xc82038c000, 0x1b, 0x1c)
        /vagrant/consul-alerts/check-handler.go:96 +0x3b7
main.(*CheckProcessor).start(0xc820199d40)
        /vagrant/consul-alerts/check-handler.go:28 +0xf9
created by main.startCheckProcessor
        /vagrant/consul-alerts/check-handler.go:152 +0x111

I checked check-handler.go and it has no updates since the build I'm currently on

fusiondog commented 8 years ago

Do you get this on every state change or just occasionally? Does it matter what you set the threshold to? Are you using a custom notifier? Would you mind posting your config?

rhuddleston commented 8 years ago

Just occasionally. Here is the config:

[u'consul-alerts/config/', 0, u'null']
[u'consul-alerts/config/checks/change-threshold', 0, u'20']
[u'consul-alerts/config/checks/enabled', 0, u'true']
[u'consul-alerts/config/notif-profiles/', 0, u'null']
[u'consul-alerts/config/notif-profiles/default', 0, u'{\n  "Interval": 60,\n}']
[u'consul-alerts/config/notifiers/log/enabled', 0, u'false']
[u'consul-alerts/config/notifiers/log/path', 0, u'/tmp/consul-notifications.log']
[u'consul-alerts/config/notifiers/slack/channel', 0, u'stage-cluster']
[u'consul-alerts/config/notifiers/slack/cluster-name', 0, u'stage']
[u'consul-alerts/config/notifiers/slack/detailed', 0, u'true']
[u'consul-alerts/config/notifiers/slack/enabled', 0, u'true']
[u'consul-alerts/config/notifiers/slack/icon-emoji', 0, u':rage:']
[u'consul-alerts/config/notifiers/slack/url', 0, u'https://hooks.slack.com/services/ABC123/ABC123/ABC123']
[u'consul-alerts/leader', 3304740253564472344, None]
[u'consul-alerts/notif-profiles/', 0, u'null']
ianic commented 8 years ago

I'm also getting the same panic ~once a day.

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x502520]

goroutine 53 [running]:
panic(0x99b5c0, 0xc820010090)
    /usr/lib/go/src/runtime/panic.go:481 +0x3e6
github.com/AcalephStorage/consul-alerts/consul.(*ConsulAlertClient).updateHealthCheck(0xc8201ca020, 0xc820482dc0, 0x3f, 0xc820410b80)
    /go/src/github.com/AcalephStorage/consul-alerts/consul/client.go:462 +0x13c0
github.com/AcalephStorage/consul-alerts/consul.(*ConsulAlertClient).UpdateCheckData(0xc8201ca020)
    /go/src/github.com/AcalephStorage/consul-alerts/consul/client.go:277 +0x718
main.(*CheckProcessor).handleChecks(0xc820070ed0, 0xc82030f500, 0x24, 0x2a)
    /go/src/github.com/AcalephStorage/consul-alerts/check-handler.go:96 +0x3b7
main.(*CheckProcessor).start(0xc820070ed0)
    /go/src/github.com/AcalephStorage/consul-alerts/check-handler.go:28 +0xf9
created by main.startCheckProcessor
    /go/src/github.com/AcalephStorage/consul-alerts/check-handler.go:152 +0x111
time="2016-09-15T14:14:41Z" level=info msg="Checking consul agent connection..."
time="2016-09-15T14:14:41Z" level=info msg="Unable to load custom config, using default instead: Unexpected response code: 500"
time="2016-09-15T14:14:41Z" level=info msg="Consul ACL Token: \"\""
time="2016-09-15T14:14:41Z" level=info msg="Consul Alerts daemon started"
time="2016-09-15T14:14:41Z" level=info msg="Consul Alerts Host: ops2"
time="2016-09-15T14:14:41Z" level=info msg="Consul Agent: 10.50.2.30:8500"
time="2016-09-15T14:14:41Z" level=info msg="Consul Datacenter: f1"
time="2016-09-15T14:14:41Z" level=info msg="Started Consul-Alerts API"
time="2016-09-15T14:14:41Z" level=info msg="Running for leader election..."
2016/09/15 14:14:41 consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (rpc error: No cluster leader), retry in 5s
time="2016-09-15T14:14:41Z" level=info msg="Unable to load custom config, using default instead: Unexpected response code: 500"
time="2016-09-15T14:14:41Z" level=info msg="Now watching for events."
time="2016-09-15T14:14:46Z" level=info msg="Now watching for health changes."
time="2016-09-15T14:14:51Z" level=info msg="Running for leader election..."

This is connected with the change in the consul cluster leadership. Notice 'Unexpected response code: 500 (rpc error: No cluster leader)' in the log above, when consul-alerts restart after panic. I'm running cluster of 3 consul server which occasionally (because of the consul sensibility to network conditions) change leader. When consul cluster is changing leader application is unable to get the lock on the key in the consul kv storage.

rhuddleston commented 8 years ago

I'm seeing the same as the above:

2016/09/23 12:47:58 [ERR] http: Request GET /v1/kv/consul-alerts/config/checks/blacklist/single/ecs-53301235/ecs-53301235.aor1.centricient.prod:ecs-dashboard-2-dashboard-90b6c3a5d4dcad991f00:41100/service:ecs-53301235.aor1.centricient.prod:ecs-dashboard-2-dashboard-90b6c3a5d4dcad991f00:41100?dc=us-west-2&token=%22%22, error: rpc error: No cluster leader from=127.0.0.1:46952 2016/09/23 12:47:58 [ERR] http: Request GET /v1/kv/consul-alerts/checks/ecs-53301235/ecs-53301235.aor1.centricient.prod:ecs-dashboard-2-dashboard-90b6c3a5d4dcad991f00:41100/service:ecs-53301235.aor1.centricient.prod:ecs-dashboard-2-dashboard-90b6c3a5d4dcad991f00:41100?dc=us-west-2&token=%22%22, error: rpc error: No cluster leader from=127.0.0.1:48054

When it gets the "No cluster leader" error consul-alerts crashes

rhuddleston commented 7 years ago

Any update here? Getting into this state very frequently on our servers