HHS / simpler-grants-gov

https://simpler.grants.gov
Other
36 stars 9 forks source link

[10k]: Logging and Monitoring #1377

Open coilysiren opened 6 months ago

coilysiren commented 6 months ago

Definition

The goal for this 10k is to define the steps towards getting simpler.grants.gov a highly effective logging and monitoring solution. We want our logging and monitoring to be high coverage, easy to access, and well understood by our team.

High Level Goals

Project Phases

Phase 1 - Choose Monitoring Platform

Phase 2 - Provision Monitoring Platform

Phase 2 - Configure Monitoring Platform

Phase 3 - Training and Documentation

Term Definitions

Logs - Single lines of data, usually a string, emitted by an application at a single point in time. An example log line is "search complete, returned 24 results".

Metrics - Aggregations of numerical data points, emitted by an application over a period of time. Metrics are usually attached to engineering signals, or high level product signals. An example metric is (translated into a sentence) "the search endpoint returned 200 status in 99.5% of requests over the last 24 hours"

Telemetry Telemetry is generally represented by "traces" and "spans", which are high signal key value data sources that focus on the behavior of a specific slice of code in an application. A span for a given endpoint can combine all of the information you would find inside of simple logs or metrics, in addition in any surrounding metadata about the execution, most notably the execution times of your functions. Telemetry is often captured inside of a product called "APM" or "Application Performance Monitoring". A well instrumented telemetry span might look like (translated into a sentence) "the search function getSearchResults ran for 200ms, spent 150ms waiting for the database, returned a 200 status, and returned 24 results"

More information about Telemetry, as viewed from the POV of OpenTelemetry specifically, can be found here: 1, 2, 3

In scope

Out of scope

Tasks

coilysiren commented 6 months ago

TODO: create architecture diagram that displays how New Relic connects to our system

coilysiren commented 6 months ago

TODO: SIA (security impact assessment)