Create Observability Stack for Monitoring and Logging

MooseQuest commented 4 years ago

Generating the observability stack serves the following purposes:

Allows for monitoring of both the cluster and application.
Identify resource contention and scale issues through metrics
Allows developers to pinpoint errors and surface them for ops alerts

Components to generate:

The data pipeline which will deliver logs from the environment and application
The metrics visualization
Data management layer either on the cluster, or provide the connectivity to another component which will surface.

Technologies and software to consider:

Elasticsearch
Grafana
Prometheus
Splunk

lottspot commented 4 years ago

Going to try out this pre-rolled stack as a starting point: https://github.com/coreos/kube-prometheus

Even if all goes well, this doesn't get us a logging stack; just metrics and monitoring.

lottspot commented 4 years ago

Roll out went well and we have metrics dashboards running at https://metrics.chime-live-cluster.phl.io/

The manifests used for the rollout are currently sitting in the issues/3 branch, where they will remain until the freeze on PR to masters is lifted.

rcknplyr commented 4 years ago

@lottspot would we consider this completed?

lottspot commented 4 years ago

We don't have anything capturing logs yet so this is technically not completed

MooseQuest commented 4 years ago

I'll be pushing up what we have so far onto a branch and will reference here.

mariekers commented 4 years ago

Would someone be interested in telling a non-devops person how this differs from #32 ?

fxdgear commented 4 years ago

Just going to leave a few comments here for posterity:

I had a conversation with @MooseQuest and he told me that Elasticserach was installed on the dev k8s cluster.

Elasticsearch was installed on a following the instructions here: https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-quickstart.html

For reference the instructions here are for installing Elastic Cloud, which is a service for managing multiple elasticsearch deployments. This of this as https://cloud.elastic.co on prem. Meaning you will have a web interface for managing multiple es clusters. You can upgrade, manage backups, etc.. It's a great service but might be overkill to have an elastic cloud serice for each chime deployment

My recommendation is that for each deployment of CHIME it would have a single deployment of Elasticsearch.

To deploy Elasticsearch (and the elastic stack at large) I would recommend using the Elasticsearch Helm Charts

Elasticsearch Helm chart requirements are:

Helm >=2.8.0 and <3.0.0 (see parent README for more details)
Kubernetes >=1.8
Minimum cluster requirements include the following to run this chart with default settings. All of these settings are configurable.
- Three Kubernetes nodes to respect the default "hard" affinity settings
- 1GB of RAM for the JVM heap

Elasticsearch being a distributed system operates on an high availability model. Meaning the minimum number of Elasticsearch nodes should be 3. This is why the kubernetes cluster must have at least 3 nodes. This allows for Elasticsearch cluster to survive a kubernetes node failure.

Using the helm charts also gives us the added benefit of being able to deploy:

elasticsearch
filebeat
metricbeat
kibana
apm-server
Filebeat can be configured to read the logs from pods in the k8s cluster and ship the logs to elasticsearch
metricbeat can be configured to collect metrics from the k8s cluster and ship them to elasticsearch
APM server is a service that runs on the k8s clusters and can accept APM data from various applications deployed in the K8s cluster and ship APM data to elasticserach.

the benefit of having all this data going into elasticsearch is that you can use Kibana to vizualize all these different data sources in one place.

Kibana also has a "logs" app which lets you tail incoming logs to elasticsearch. You can even filter on k8s labels or pod names or namespaces etc..

The elastic apm service currently has support the following languages

Go
Java
.NET
Node.js
Python
Ruby

themightychris commented 4 years ago

@fxdgear long term, we're not looking to give each deployment of CHIME its own cluster. That was a stop-gap measure to proceed quickly. Eventually, we want to have a single prod cluster hosting many civic applications including chime, alternate versions of chime, follow-up projects related to chime, and other local civic projects. We are thinking that each project would be within its own namespace.

We need an infrastructure that gets us as close as possible to each project/namespace being free-when-idle. Any cluster services that we need to deploy instances of per-project/namespace will create poor economics for us. We have very modest funding within which we need to be able to host a large number of low-traffic projects sustainably for many years. At any given time, only a small number of projects, if any, will have high traffic. It's kind of an inverse scenario of most enterprise use cases

Given that, would you adjust your recommendations at all?

fxdgear commented 4 years ago

@themightychris Thanks for the quick response.

Given the longterm goal of a single K8s cluster with multiple namespaces what I think I would recommend in this case is the following:

Deploy the elastic stack into it's own namespace
- APM
- *beats
- Elasticsearch
- Kibana
Configure the *beats to read from ALL namespaces
Setup APM server to run as an internal service (ie no ingres)
- configure your apps which send APM data to APM Server to communicate to the full service name. ie service-name.namespace.svc.cluster.local

The end goal here being (wrt the elastic stack) is that it's a single deployment of the elastic tooling. It's configured in a way that lets you add and remove namespaces (ie various CHIME related projects and deployments)

But you end up with a singular entity to monitor ALL your deployments.

This was not explicit in my previous comment, but the goal here is that a if you end up having multiple k8s clusters or a single k8s cluster you still only need a single elastic stack deployment per k8s cluster.

This strategy will scale regardless.

On another note, depending on volume of logs/metrics you may or may run out of disk space for storing data in Elasticsearch. There's a couple ways to handle this.

If you have a policy on the length of time you are required (or want) to store logs you can do any of the following:

increase the disk size of your PVC to account for the ammount of data you need to store.
sheduled snapshots of the data to store outside the cluster.
Roll ups (basically storing older data with lower fidelity)
And finally using ILM (Index Lifecycle Management) to automate a lot of this to ensure your disks don't overload with stale data.

CodeForPhilly / chime

Create Observability Stack for Monitoring and Logging #3