feat(influxdb): Use stable/influxdb chart

jchauncey commented 7 years ago

closes #180

This moves to using the influxdb chart in the stable repository
Sets the version to 0.4.0
Adds two pre-upgrade hooks to the base monitor chart. These hooks allow the chart to migrate data from the old volume mount location to the new one provided by the influxdb chart.

Manual Testing Steps:

Notes

This PR does not allow for upgrading from no persistence to persistence.
If the influxdb pod gets stuck in a ContainerCreating state it's normally because the pod has moved from the node that originally hosted the pvc. If you kill the pod it will be reassigned and should start normally. I am going to work on a hook that will automatically do this once I get the majority of this PR done.

Prereqs

The logger-redis-cache secret (you can get this from the workflow chart)
The ability to create a persistent volume
Running at least version 2.3.0 of tiller (helm init --upgrade)

Test InfluxDB persistence Migration

Install master of `deis/monitor`

helm upgrade deis-monitor . --install --namespace deis --set influxdb.persistence.enabled=true
kubectl exec into the running influxdb pod. You will notice that all the data in stored at /data/*

Install this PR of `deis/monitor`

Clone this PR
Run a helm dependency build in charts/monitor
Run helm upgrade deis-monitor . --namespace deis --set influxdb.persistence.enabled=true
While this upgrade is taking place you can do a kubectl get pods --watch --namespace=deis and watch the kill-pods and migrate-data pods do their thing.
kubectl exec into the new running influxdb pod. You should now see data in /var/lib/influxdb/data/*

Test InfluxDB deployment resource is deleted after upgrade/install

We can no longer just turn off parts of the chart. Instead we must delete the deployment resource after install/upgrade. The biggest problem here is that the influxdb chart programmatically builds its name so we rely on the fact that most deployments should be {{ .Release.Name }}-influxdb.

Install

This is to validate that a clean install of this chart will do the right thing with off cluster influx.

Start with a clean cluster
Run helm upgrade deis-monitor . --install --namespace deis --set global.influxdb_location=off-cluster,influxdb.url=http://some.other.url
Validate that influxdb is not running after install is finished
kubectl exec into a running telegraf pod. Check that config.toml has your influxdb url set.

Upgrade

This is to validate that when upgrading from an existing install (no matter its configuration) it will do the correct thing. It should be noted that we will kill off the telegraf pods and let them restart so they can pick up the new configuration of off cluster influx support in case the user was not usign that previously.

Start with an existing monitor install (it doesn't matter its state)
Run helm upgrade deis-monitor . --namespace deis --set global.influxdb_location=off-cluster,influxdb.url=http://some.other.url
Validate that influxdb is not running after upgrade is finished
kubectl exec into a running telegraf pod. Check that config.toml has your influxdb url set.

Test No Persistence -> No Persistence upgrade

Install master of `deis/monitor`

helm upgrade deis-monitor . --install --namespace deis\

Install this PR of `deis/monitor`

Clone this PR
Run a helm dependency build in charts/monitor
Run helm upgrade deis-monitor . --namespace deis
Validate that all pods come up correctly and data is shown in grafana

Test older tiller vs New tiller

Make sure you already have a version of monitor installed since we are only testing the upgrade path here

Install a version older than 2.3.0 of tiller

Clone repo
git checkout a tag (2.2.3 for example)
make build
./bin/tiller
Run helm upgrade deis-monitor . --namespace deis --host localhost:44134 --set influxdb.persistence.enabled=true
Validate that no upgrade hooks were ran by doing kubectl get jobs
The upgrade should proceed and install fine (but it will not migrate the data). Since we have no way to stop the execution of the chart at this point this is the best we can do. @mboersma is this ok??

deis-bot commented 7 years ago

@rimusz is a potential reviewer of this pull request based on my analysis of git blame information. Thanks @jchauncey!

vdice commented 7 years ago

With respect to Test InfluxDB deployment resource is deleted after upgrade/install, all appears well (expectations met) with the following caveat: the helm upgrade command technically errors out and exits non-zero:

 $ helm upgrade deis-workflow workflow-pr/workflow --version v2.12.1-20170330220532-sha.77e675e --set global.influxdb_location=off-cluster
Release "deis-workflow" has been upgraded. Happy Helming!
Error: deployments.extensions "deis-workflow-influxdb" not found

 $ echo $?
1

Possible to avoid erroring out in this scenario?

vdice commented 7 years ago

W/r/t Test InfluxDB persistence Migration:

I tested at the level of a full Workflow install and although I am meeting the expectations listed in the description (new volume locations for storing influxdb data), after Workflow upgrade my grafana instance cannot locate the migrated data (dashboards empty, grafana pod logs show 2017/03/30 22:38:07 http: proxy error: dial tcp 10.131.246.29:80: i/o timeout)

Here were my steps: https://gist.github.com/vdice/d0325647ad4136feb76fa8e9e0a0725e

vdice commented 7 years ago

W/r/t Test existing installation with no persistence and upgrade to influxdb with persistence:

When attempting to upgrade, the expectations are not met; namely, the migrate-data job/pod does not finish (see below) and the upgrade fails:

 $ helm upgrade --wait deis-workflow workflow-pr/workflow --version v2.12.1-20170330220532-sha.77e675e -f values-only-influxdb-persistent.yaml
Error: UPGRADE FAILED: timed out waiting for the condition

 $ kd describe po migrate-data-5zzmh-msld4
...
Events:
  FirstSeen LastSeen    Count   From            SubObjectPath   Type        Reason          Message
  --------- --------    -----   ----            -------------   --------    ------          -------
  2m        1s      11  {default-scheduler }            Warning     FailedScheduling    [SchedulerPredicates failed due to persistentvolumeclaims "deis-monitor-influxdb" not found, which is unexpected., SchedulerPredicates failed due to persistentvolumeclaims "deis-monitor-influxdb" not found, which is unexpected., SchedulerPredicates failed due to persistentvolumeclaims "deis-monitor-influxdb" not found, which is unexpected.]

jchauncey commented 7 years ago

This PR is waiting on an upstream change before it can be tested.

jchauncey commented 7 years ago

With the k8s bug around binding persistent volumes on pods that have moved nodes I think I will just close this PR for now.

deis / monitor