deis / monitor

Monitoring for Deis Workflow
https://deis.com
MIT License
22 stars 32 forks source link

feat(influxdb): Use stable/influxdb chart #185

Closed jchauncey closed 7 years ago

jchauncey commented 7 years ago

closes #180

Manual Testing Steps:

Notes

Prereqs

Test InfluxDB persistence Migration

Install master of deis/monitor

Install this PR of deis/monitor

Test InfluxDB deployment resource is deleted after upgrade/install

We can no longer just turn off parts of the chart. Instead we must delete the deployment resource after install/upgrade. The biggest problem here is that the influxdb chart programmatically builds its name so we rely on the fact that most deployments should be {{ .Release.Name }}-influxdb.

Install

This is to validate that a clean install of this chart will do the right thing with off cluster influx.

Upgrade

This is to validate that when upgrading from an existing install (no matter its configuration) it will do the correct thing. It should be noted that we will kill off the telegraf pods and let them restart so they can pick up the new configuration of off cluster influx support in case the user was not usign that previously.

Test No Persistence -> No Persistence upgrade

Install master of deis/monitor

Install this PR of deis/monitor

Test older tiller vs New tiller

Install a version older than 2.3.0 of tiller

deis-bot commented 7 years ago

@rimusz is a potential reviewer of this pull request based on my analysis of git blame information. Thanks @jchauncey!

vdice commented 7 years ago

With respect to Test InfluxDB deployment resource is deleted after upgrade/install, all appears well (expectations met) with the following caveat: the helm upgrade command technically errors out and exits non-zero:

 $ helm upgrade deis-workflow workflow-pr/workflow --version v2.12.1-20170330220532-sha.77e675e --set global.influxdb_location=off-cluster
Release "deis-workflow" has been upgraded. Happy Helming!
Error: deployments.extensions "deis-workflow-influxdb" not found

 $ echo $?
1

Possible to avoid erroring out in this scenario?

vdice commented 7 years ago

W/r/t Test InfluxDB persistence Migration:

I tested at the level of a full Workflow install and although I am meeting the expectations listed in the description (new volume locations for storing influxdb data), after Workflow upgrade my grafana instance cannot locate the migrated data (dashboards empty, grafana pod logs show 2017/03/30 22:38:07 http: proxy error: dial tcp 10.131.246.29:80: i/o timeout)

Here were my steps: https://gist.github.com/vdice/d0325647ad4136feb76fa8e9e0a0725e

vdice commented 7 years ago

W/r/t Test existing installation with no persistence and upgrade to influxdb with persistence:

When attempting to upgrade, the expectations are not met; namely, the migrate-data job/pod does not finish (see below) and the upgrade fails:

 $ helm upgrade --wait deis-workflow workflow-pr/workflow --version v2.12.1-20170330220532-sha.77e675e -f values-only-influxdb-persistent.yaml
Error: UPGRADE FAILED: timed out waiting for the condition

 $ kd describe po migrate-data-5zzmh-msld4
...
Events:
  FirstSeen LastSeen    Count   From            SubObjectPath   Type        Reason          Message
  --------- --------    -----   ----            -------------   --------    ------          -------
  2m        1s      11  {default-scheduler }            Warning     FailedScheduling    [SchedulerPredicates failed due to persistentvolumeclaims "deis-monitor-influxdb" not found, which is unexpected., SchedulerPredicates failed due to persistentvolumeclaims "deis-monitor-influxdb" not found, which is unexpected., SchedulerPredicates failed due to persistentvolumeclaims "deis-monitor-influxdb" not found, which is unexpected.]
jchauncey commented 7 years ago

This PR is waiting on an upstream change before it can be tested.

jchauncey commented 7 years ago

With the k8s bug around binding persistent volumes on pods that have moved nodes I think I will just close this PR for now.