Closed martinothamar closed 1 month ago
cc @sduranc @bengtfredh - this proposal reflects our current thoughts on the topic. Would appreciate any reviews/edit from you guys. In particular the infra stuff is just guessing on my part still
Because comment diff is not so good, here are the changes I made: some typos write out the abbreviations that may not be so obvious: (SDLC, SLO, CRD, TCO, WAL og PV)
EDIT: never mind, there is a good way to see diff, TIL
Seems like MS plans for moving to OTel protocol has been public for some time: https://github.com/MicrosoftDocs/azure-docs/commit/25d58a0c1e5a1d5740d99fd68d89a9372042838e
Additionally, ingestion based on instrumentation key is going to go out of support on 31.03.2025. Considering the Azure Monitor OTel exporter only supports connection string anyway, I'll add another point about migrating from ApplicationInsights__InstrumentationKey
to APPLICATIONINSIGHTS_CONNECTION_STRING
for the env variables added to the pod in app clusters. app-lib has been updated to support APPLICATIONINSIGHTS_CONNECTION_STRING
in addition to the older variants (prefers connection string in case there are both) in https://github.com/Altinn/app-lib-dotnet/pull/589/commits/0ba8f5abc4552fde56aa952afcfddf7dcf56f09a
https://learn.microsoft.com/en-us/azure/azure-monitor/app/asp-net-core?tabs=netcorenew
NAIS just posted a blogpost on their work around OTel: https://nais.io/blog/posts/otel-from-0-to-100/
Accepted 30.05 by the architecture team.
@martinothamar Azure Monitor connection in Grafana is ready to be tested in ttd-tt02-aks
Status
Accepted
Context
Instrumentation and telemetry is typically thought of as an "Ops" thing, used mostly when debugging and investigating bugs and outages in systems. For a long time developers relied mainly on application logs to debug issues. As various vendors brought their monitoring products to market, such as Azure Application Insights, distributed tracing and metrics have become a part of the toolkit for debugging/monitoring. Still, instrumentation of traces and metrics are largely automatic and therefore with minimal context (some app/domain-specific context is typically provided in custom middlewares, enrichers or similar). Application log mesages are often added after issues are discovered which leads to high Mean-Time-To-Recovery (MTTR) and prolonged debugging sessions. Over time application architectures have grown more distributed (microservices) and therefore a lot more complex, and time to recovery takes longer as lots of sparse log lines need to be understood to root-cause an issue.
There has in recent years been a shift in perspective and approach to monitoring/instrumentation/telemetry, brought on at least partially by
When trying to develop observable systems, we consider the observability of components in our system. Components should emit some telemetry that will help us understand how it executes in production both in terms of performance and failure modes. Optimizing for observability leads to custom instrumentation as that improves our ability to understand the context of the telemetry. An example of this would be to create a custom OTel span for
Instance.Create
- including custom domain-specific tags/labels, in addition to the Platform Storage API spans provided by automatic HTTP client instrumentation.A well-customized instrumentation setup that brings app/domain-specific context can improve debugging-ability a lot, but can also be utilized across several other Software Development LifeCycle (SDLC) phases:
With this in mind, we can consider observability another tool in the toolbox for improving developer experience in Altinn 3, in addition to improving operational quality of apps and services.
Current state
For service owner apps and services in the Altinn platform we are using
Microsoft.ApplicationInsights
for automatic instrumentation and Azure Monitor for monitoring.There are a couple of issues with Azure Montitor and this approach in general
Material
Decision
We should
IOpenTelemetryBuilder
for custom config for appsUseOpenTelemetry
(opt-in, default is still Application Insights SDK for v8)APPLICATIONINSIGHTS_CONNECTION_STRING
env variable containing App Insights connection string to deployments (we can removeApplicationInsights__InstrumentationKey
later in the case of application deployments)Why OpenTelemetry
OpenTelemetry is quickly becoming the de facto standard for telemetry instrumentation and shipping (the protocol). Most (if not all) vendors, databases, programming languages and frameworks/libraries are moving in the direction of using the relevant aspects of OpenTelemetry (semantic convetions, collector, wire-protocol etc).
For us as platform builders, it enables us to
Why Grafana products
Instrumentation principles
Plan
Consequences
Microsoft.Extensions.Logging
), such as telemetry processors/middleware/filters/enrichersConsiderations
Microsoft.Extensions.Logging
, so application logs should look the same after a migration (no code changes)Previous work