deis / monitor

Monitoring for Deis Workflow
https://deis.com
MIT License
22 stars 32 forks source link

chore(grafana): Update dashboards for new telegraf #156

Closed jchauncey closed 7 years ago

jchauncey commented 7 years ago

Tags for data collected by the kubernetes changed after the plguin was merged. So we needed to update the dashboards.

Test Steps:

It is also useful to verify that we are not leaking connections with this PR as it will contain the new telegraf binary that contains the fix.

To check for that please ssh onto one of your worker nodes that is running the telegraf daemonset and run the following command netstat -tan | grep 10255 you should only see 1 or 2 connections open and it should never grow.

jchauncey commented 7 years ago

will fix #153

mboersma commented 7 years ago

I'm testing this as described above but running into a telegraf error:

$ kd logs -f deis-monitor-telegraf-kxpcn
Node Name set (minikube)
Node IP set (192.168.99.100)
Creating topic with URL: http://10.0.0.36:4151/topic/create?topic=metrics
Setting KUBERNETES_URL: http://192.168.99.100:10255
Building config.toml!
Finished building toml...
###########################################
...
# Set Service Input Configuration
[[inputs.nsq_consumer]]
  server = "10.0.0.36:4150"
  topic = "metrics"
  channel = "consumer"
  max_in_flight = 100
  data_format = "influx"
###########################################
###########################################
/usr/bin/telegraf: line 1: syntax error near unexpected token `<'
/usr/bin/telegraf: line 1: `<?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message><Details>Anonymous users does not have storage.objects.get access to object telegraf/telegraf.</Details></Error>'
mboersma commented 7 years ago

It appears telegraf failed to be downloaded into the container:

root@7f3ec01de15b:/# which telegraf
/usr/bin/telegraf
root@7f3ec01de15b:/# cat /usr/bin/telegraf 
<?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message><Details>Anonymous users does not have storage.objects.get access to object telegraf/telegraf.</Details></Error>
root@7f3ec01de15b:/# 
felixbuenemann commented 7 years ago

The calls in the telegraf Dockerfile should be using curl -fsSL instead of curl -sSL so that they return a bad exit code on errors:

curl -sSL https://storage.googleapis.com/telegraf/telegraf ; echo -e "\nExit code: $?"
<?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message><Details>Anonymous users does not have storage.objects.get access to object telegraf/telegraf.</Details></Error>
Exit code: 0

curl -fsSL https://storage.googleapis.com/telegraf/telegraf ; echo -e "\nExit code: $?"
curl: (22) The requested URL returned error: 403 Forbidden

Exit code: 22
mboersma commented 7 years ago

Tested as recommended, all dashboards show updating CPU/memory stats and open connections look stable:

$ netstat -tan | grep 10255
tcp        0      0 :::10255                :::*                    LISTEN      
tcp        0      0 ::ffff:192.168.99.100:10255 ::ffff:172.17.0.9:55846 ESTABLISHED