CodeForPhilly / chime

COVID-19 Hospital Impact Model for Epidemics
https://codeforphilly.github.io/chime/
MIT License
205 stars 151 forks source link

Create Observability Stack for Monitoring and Logging #3

Open MooseQuest opened 4 years ago

MooseQuest commented 4 years ago

Generating the observability stack serves the following purposes:

Components to generate:

Technologies and software to consider:

lottspot commented 4 years ago

Going to try out this pre-rolled stack as a starting point: https://github.com/coreos/kube-prometheus

Even if all goes well, this doesn't get us a logging stack; just metrics and monitoring.

lottspot commented 4 years ago

Roll out went well and we have metrics dashboards running at https://metrics.chime-live-cluster.phl.io/

The manifests used for the rollout are currently sitting in the issues/3 branch, where they will remain until the freeze on PR to masters is lifted.

rcknplyr commented 4 years ago

@lottspot would we consider this completed?

lottspot commented 4 years ago

We don't have anything capturing logs yet so this is technically not completed

MooseQuest commented 4 years ago

I'll be pushing up what we have so far onto a branch and will reference here.

mariekers commented 4 years ago

Would someone be interested in telling a non-devops person how this differs from #32 ?

fxdgear commented 4 years ago

Just going to leave a few comments here for posterity:

I had a conversation with @MooseQuest and he told me that Elasticserach was installed on the dev k8s cluster.

Elasticsearch was installed on a following the instructions here: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-quickstart.html

For reference the instructions here are for installing Elastic Cloud, which is a service for managing multiple elasticsearch deployments. This of this as https://cloud.elastic.co on prem. Meaning you will have a web interface for managing multiple es clusters. You can upgrade, manage backups, etc.. It's a great service but might be overkill to have an elastic cloud serice for each chime deployment

My recommendation is that for each deployment of CHIME it would have a single deployment of Elasticsearch.

To deploy Elasticsearch (and the elastic stack at large) I would recommend using the Elasticsearch Helm Charts

Elasticsearch Helm chart requirements are:

Elasticsearch being a distributed system operates on an high availability model. Meaning the minimum number of Elasticsearch nodes should be 3. This is why the kubernetes cluster must have at least 3 nodes. This allows for Elasticsearch cluster to survive a kubernetes node failure.

Using the helm charts also gives us the added benefit of being able to deploy:

the benefit of having all this data going into elasticsearch is that you can use Kibana to vizualize all these different data sources in one place.

Kibana also has a "logs" app which lets you tail incoming logs to elasticsearch. You can even filter on k8s labels or pod names or namespaces etc..

The elastic apm service currently has support the following languages

themightychris commented 4 years ago

@fxdgear long term, we're not looking to give each deployment of CHIME its own cluster. That was a stop-gap measure to proceed quickly. Eventually, we want to have a single prod cluster hosting many civic applications including chime, alternate versions of chime, follow-up projects related to chime, and other local civic projects. We are thinking that each project would be within its own namespace.

We need an infrastructure that gets us as close as possible to each project/namespace being free-when-idle. Any cluster services that we need to deploy instances of per-project/namespace will create poor economics for us. We have very modest funding within which we need to be able to host a large number of low-traffic projects sustainably for many years. At any given time, only a small number of projects, if any, will have high traffic. It's kind of an inverse scenario of most enterprise use cases

Given that, would you adjust your recommendations at all?

fxdgear commented 4 years ago

@themightychris Thanks for the quick response.

Given the longterm goal of a single K8s cluster with multiple namespaces what I think I would recommend in this case is the following:

The end goal here being (wrt the elastic stack) is that it's a single deployment of the elastic tooling. It's configured in a way that lets you add and remove namespaces (ie various CHIME related projects and deployments)

But you end up with a singular entity to monitor ALL your deployments.

This was not explicit in my previous comment, but the goal here is that a if you end up having multiple k8s clusters or a single k8s cluster you still only need a single elastic stack deployment per k8s cluster.

This strategy will scale regardless.


On another note, depending on volume of logs/metrics you may or may run out of disk space for storing data in Elasticsearch. There's a couple ways to handle this.

If you have a policy on the length of time you are required (or want) to store logs you can do any of the following:

  1. increase the disk size of your PVC to account for the ammount of data you need to store.
  2. sheduled snapshots of the data to store outside the cluster.
  3. Roll ups (basically storing older data with lower fidelity)
  4. And finally using ILM (Index Lifecycle Management) to automate a lot of this to ensure your disks don't overload with stale data.