Altinn / app-lib-dotnet

Libraries used in Altinn Apps
BSD 3-Clause "New" or "Revised" License
7 stars 10 forks source link

Instrumentation and monitoring for apps #580

Open martinothamar opened 3 months ago

martinothamar commented 3 months ago

Description

This is a big draft/braindump issue for problems surrounding monitoring and telemetry across SDLC (software development lifecycle) for Altinn 3.

Prompted by TEs and users, we need to clarify responsibilities regarding monitoring apps running in environments. A line can be drawn between what should be the responsibility of the platform owners and the platform users/application developers.

In addition, we need to advise TEs on how to monitor their application. We need to provide an instrumentation and diagnostics setup that is usable and flexible both for us as library developers and for app developers - where correlation and contextualization is mostly automatic, but still customizable. Flexibility must also be present in usage of the telemetry. We use 1 vendor today, but might want to completely switch vendors or use two different vendors at the same time.

In scope

To deliver good DevEx related to monitoring and operational aspects we need

Out of scope

?

Additional Information

Who should monitor what

Platform team should monitor infrastructure components such as AKS/k8s and related infrastructure

The platform team is primarily a team from Digdir, but in close cooperation with TE especially regarding scaling and security issues.

Service owners should monitor their application, for example

Digdir Team Apps should monitor based on library code, for example

Since issues discovered in applications may originate both from app and library code, and it requires investigation to know which, library code in particular needs to be well instrumented, and there should be a process in place for efficient collaboration between the teams/incident responders during incidents. Both Team Apps and the app development teams need to be able to access telemetry through some monitoring and analysis tool such as Azure App Insights or Grafana.

Instrumentation guidance for app developers

It's tempting to just ask TEs what they want to monitor, but usually they don't know, as the primary competence app teams bring is usually less about operational aspects such as monitoring, and more about building good products. Good culture and process for instrumentation and monitoring can significantly improve operational performance of systems, lead to less bugs and deliver better value.

We should put Dev & Ops together and build some deliverables that makes it simple for app teams to be good at operating their applications.

Technical design

To ensure flexibility in vendor selection and instrumentation, we should use OpenTelemetry as the standard and library abstraction, as it has great .NET support (all APIs stable). Most library infrastructure and abstraction we need is built into the BCL, so we can drop some dependencies and stick to BCL and the core OpenTelemetry libriaries. Proposed design

We currently have two mechanisms for telemetry today

Some users may rely on TelemetryClient for custom instrumentation today (and app-lib does some), but Prometheus metrics are not being shipped yet currently (PR for helm chart).

Deployment plan:

Design considerations

As developers of app libraries we are responsible for developing/configuring/exposing abstractions that are well suited and flexible in use for instrumentation of code to gain observability. We are also responsible for shipping and processing the telemetry in such a way that developers can make use of this capability in multiple phases of software development

In addition, telemetry and monitoring can be useful in the planning and delivery phases of software development lifecycle

Why otel?

Kinds of telemetry

When standard telemetry fails

Sometimes you can verify that there are issues, and what the nature of those issues are, but need more information to fix Examples:

Tasks

martinothamar commented 2 months ago

ADR proposal: https://github.com/Altinn/architecture-decision-log/issues/3

martinothamar commented 1 month ago

Relevant issue for client side analytics: https://github.com/Altinn/app-frontend-react/issues/853