Observability should be part of the network

Currently observability logic only exists as part of the simulation. We should instead enable networks to be observed directly without the need of a simulation.

We want the design of observability to have good defaults but not to require users to deploy specific observability solutions.

Observability

There are three aspects to observability:

Logs
Metrics
Traces

We will need a solution for each of these were its easy for users of Keramik to collect logs, metrics, and traces but are still able to use their preferred collection agent.

For example a default installation may use log stream for logs, prometheus for metrics and jaeger for traces. However these specific technologies should not be required.

Pull vs Push

Keramik should default to pulling metrics data. This is because there are multiple consumers of the metrics. For example a common usage of Keramik will be to have one metric stack pulling metrics for the network as a long lived system in addition to a short lived metrics stack that pulls metrics for a single simulation run. If pods are expected to push their metrics then they must be restarted in order to accommodate multiple metrics endpoints. This is not a viable solution. Therefore metrics collection must be done using pulling (i.e. prometheus scraping).

However this is unique to metrics, both logs and traces need only support a single endpoint and push is acceptable for those cases.

Defaults vs Custom

Keramik should have two modes of configuring observability:

Default mode where the entire stack is managed by Keramik
Custom mode where Keramik only exposes information about what resources it has created so external observability stack can discover and monitor them.

The default stack should be very simple, I propose we use Prometheus for metrics, Jaeger for tracing and rely on vanilla k8s logging for logs. This is likely best achieved using opentelemetry as the collection agent.

Requirements:

Metrics

Metrics are pulled from nodes via Prometheus textual format
Users are not required to use Prometheus as the metrics backend but can bring their own

Logs

Logs are collected
Should we use log stream?

Traces

Traces are collected via OTLP
Users are not required to use Jaeger but can bring their own tracing collector so long as it supports OTLP tracing

Implementation

The above is a rough specification of how we want observability to be managed by Keramik. What follows is one way it could be implemented. I am open to other solutions.

Default Mode

Keramik will deploy an opentelemetry collector to the network and will use various k8s discovery mechanisms to find all ceramic pods and scrape them for their metrics. The opentelemetry collector will then publish those metrics to a Prometheues instance with a persistent volume to track metrics for the network.

Each node will be configured with the opentelemety OTLP endpoint for sending traces. The opentelemetry collector will then forward those traces (in batches) to Jaeger running in all-in-one mode with storage in memory only. Traces are generally large and so keeping only what fits in memory provides a simple limit to the amount of traces that are available.

For logs nothing specific will be done, relying on vanilla k8s logging infrastructure for each of the pods.

Custom Mode

In this mode Keramik will not deploy Opentelemetry, Prometheus or Jaeger. Instead will publish CRDs or use labels so other operators can discover and observe the pods. The specifics here will need some research to explore what strategy is best. Open to ideas.

Sounds great. I don't know that a lot needs to change, but my $0.02

For logs, change nothing. There are lots of cloud provider or third party k8s log management tools that should work fine.

For traces, expose a collector setting for a network. If it's set, don't collect simulation traces. Otherwise, deploy and collect traces with the current design during simulations.

For metrics: Option 1 - use the ServiceMonitor and PodMonitor CRDs produced by the prometheus-operator project. These CRD resources would be created by keramik in the namespace of the network. In the default case, locally scoped operator would discover resources in the namespace, while a cluster scoped operator would monitor multiple networks across namespaces.

Option 2 - use the Prometheus HTTP Service discovery feature to query the keramik controller for metric endpoint collection. In the default case, locally scoped operator would query keramik operator with a namespace option (http://keramik/metrics/?namespace=keramik-small) to discover metric endpoints. The custom mode would allow a cluster scoped operator to query for metric endpoints in all namespaces.

The first option is nice and simple for keramik, but opentelemetry requires an additional component (otel-tag-allocator to consume PodMonitor and ServiceMonitors.

Either option allow metric collection by keramik deployed observability tools as well as any custom tooling.

3box / keramik