Altinn / architecture-decision-log

Log for architecture decisions across Altinn teams and products
0 stars 0 forks source link

Monitoring, observability and instrumentation with OpenTelemetry for Altinn 3 #3

Open martinothamar opened 2 weeks ago

martinothamar commented 2 weeks ago

Status

Proposed

Context

Instrumentation and telemetry is typically thought of as an "Ops" thing, used mostly when debugging and investigating bugs and outages in systems. For a long time developers relied mainly on application logs to debug issues. As various vendors brought their monitoring products to market, such as Azure Application Insights, distributed tracing and metrics have become a part of the toolkit for debugging/monitoring. Still, instrumentation of traces and metrics are largely automatic and therefore with minimal context (some app/domain-specific context is typically provided in custom middlewares, enrichers or similar). Application log mesages are often added after issues are discovered which leads to high Mean-Time-To-Recovery (MTTR) and prolonged debugging sessions. Over time application architectures have grown more distributed (microservices) and therefore a lot more complex, and time to recovery takes longer as lots of sparse log lines need to be understood to root-cause an issue.

There has in recent years been a shift in perspective and approach to monitoring/instrumentation/telemetry, brought on at least partially by

When trying to develop observable systems, we consider the observability of components in our system. Components should emit some telemetry that will help us understand how it executes in production both in terms of performance and failure modes. Optimizing for observability leads to custom instrumentation as that improves our ability to understand the context of the telemetry. An example of this would be to create a custom OTel span for Instance.Create - including custom domain-specific tags/labels, in addition to the Platform Storage API spans provided by automatic HTTP client instrumentation.

A well-customized instrumentation setup that brings app/domain-specific context can improve debugging-ability a lot, but can also be utilized across several other Software Development LifeCycle (SDLC) phases:

With this in mind, we can consider observability another tool in the toolbox for improving developer experience in Altinn 3, in addition to improving operational quality of apps and services.

Current state

For service owner apps and services in the Altinn platform we are using Microsoft.ApplicationInsights for automatic instrumentation and Azure Monitor for monitoring. There are a couple of issues with Azure Montitor and this approach in general

Material

Decision

We should

Why OpenTelemetry

OpenTelemetry is quickly becoming the de facto standard for telemetry instrumentation and shipping (the protocol). Most (if not all) vendors, databases, programming languages and frameworks/libraries are moving in the direction of using the relevant aspects of OpenTelemetry (semantic convetions, collector, wire-protocol etc).

For us as platform builders, it enables us to

Why Grafana products

Instrumentation principles

Plan

  1. First delivery - Q2 2024
    • Instrument app-lib
    • Export telemetry directly to Azure Monitor in deployed environments
    • Export to local LGTM stack when running locally with app-localtest
    • Add Azure Monitor datasource to Grafana in app clusters
    • Altinn Studio docs
    • What comes out of the box - exploration/debugging
    • Custom instrumentation
    • Custom exporter/sink
  2. Agents, architectural flexibility
    • Implement agent in app clusters
    • Export telemetry through agents
    • Consider enrichment, better routing/cetralization per team
  3. Grafana databases
    • Implement telemetry databases
    • Built in dashboard
    • Custom dashboards
    • Alerting

image

Consequences

Considerations

Previous work

martinothamar commented 2 weeks ago

cc @sduranc @bengtfredh - this proposal reflects our current thoughts on the topic. Would appreciate any reviews/edit from you guys. In particular the infra stuff is just guessing on my part still

HauklandJ commented 2 weeks ago

Because comment diff is not so good, here are the changes I made: some typos write out the abbreviations that may not be so obvious: (SDLC, SLO, CRD, TCO, WAL og PV)

EDIT: never mind, there is a good way to see diff, TIL

martinothamar commented 1 week ago

Seems like MS plans for moving to OTel protocol has been public for some time: https://github.com/MicrosoftDocs/azure-docs/commit/25d58a0c1e5a1d5740d99fd68d89a9372042838e

Additionally, ingestion based on instrumentation key is going to go out of support on 31.03.2025. Considering the Azure Monitor OTel exporter only supports connection string anyway, I'll add another point about migrating from ApplicationInsights__InstrumentationKey to APPLICATIONINSIGHTS_CONNECTION_STRING for the env variables added to the pod in app clusters. app-lib has been updated to support APPLICATIONINSIGHTS_CONNECTION_STRING in addition to the older variants (prefers connection string in case there are both) in https://github.com/Altinn/app-lib-dotnet/pull/589/commits/0ba8f5abc4552fde56aa952afcfddf7dcf56f09a

https://learn.microsoft.com/en-us/azure/azure-monitor/app/asp-net-core?tabs=netcorenew

martinothamar commented 4 days ago

NAIS just posted a blogpost on their work around OTel: https://nais.io/blog/posts/otel-from-0-to-100/