department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
281 stars 197 forks source link

[epic] DR | Code Assessment, Logging, and Monitoring #57473

Open saderagsdale opened 1 year ago

saderagsdale commented 1 year ago

Value Statement

As a software developer working on an online application I want to be able to assess the health of our codebase So that I can ensure that the application is maintainable, scalable, and performs optimally.


Background Context

When we start working on a new feature or bug fix, we need to be confident that the code we're working with is healthy and won't cause any unforeseen issues or technical debt. To do this, we need to be able to assess the health of the codebase regularly, identifying any areas that require attention or improvement. This means we need access to tools that can help us identify code smells, performance bottlenecks, and other potential issues.

We want to be able to use industry-standard tools and techniques to assess the health of our codebase, such as static code analysis, unit testing, and code reviews. We want to be able to easily track the progress of our code health efforts over time and collaborate with others efficiently to address any issues that arise.

Ultimately, by having a healthy codebase, we can be more productive and efficient in our work, delivering high-quality features and bug fixes that meet and exceed our users' expectations.

Acceptance Criteria

Tasks

Definition of Ready

Definition of Done

saderagsdale commented 1 year ago

Phases:

Assessment

Filter queries to reduce noise + identify big issues Review existing logging and monitoring to check for things that were once important Review the form to understand what actions the form can take (list them out) Identify gaps in tracing issues (example: jobs missing from query, downtime from upstream services, add body of errors from upstream services) Flag error messages that aren't actionable (backend or frontend) If possible, recommend fixes for frontend errors to improve user experience KPIs

Craft KPIs for how you measure success for those things Implementation

Determine what you want to see in a dashboard (visuals, alerts, grouping/ungrouping) Add logging points Review with eng (squad 2) Bugs

Prioritization and fixes for anything you've identified

saderagsdale commented 1 year ago

Ticket #1: Identify KPIs for all five of these forms ("assessment") Ticket #2: Set up monitoring and logging for these KPIs if not already set up ("implementation") Ticket #3: Refine sentry error monitoring dashboards for all five forms ("assessment" and "implementation", simultaneously)

Ticket #1 and Ticket #2 must be done in series, and will ultimately result in a series of DataDog dashboards Ticket #3 will ultimately result in a series of Sentry dashboards (the same ones you already saw, just refined) Ticket #3 and Tickets #1-#2 can be done in parallel

Mottie commented 1 year ago

Did we want to include COE in this list?

saderagsdale commented 1 year ago

Nope. That's going to another team in the future. @Mottie

saderagsdale commented 1 year ago
saderagsdale commented 1 year ago
va-albers commented 1 year ago

"Conduct error investigation using Sentry and Grafana logs" - If we have things we get from Grafana that we can't find in other monitoring systems we should note them to make sure they are carried over (since Grafan is being deprecated).

va-albers commented 1 year ago

Thanks @saderagsdale this looks great. Do you think there are parts of this work that should be built into the template for future development work, and what should that look like? Would that belong as part of this Decision Review, or in it's own separate issue?

saderagsdale commented 1 year ago

@va-albers Hey Steve! Good question. Ultimately we want to make the framework repeatable, so the goal is to use this as a pilot to identify what should be in the template. What that looks like is best defined by developers (so it's intuitive for them). Could be template issues that get stored in Github and become part of the onboarding/collab cycle for engineers.

data-doge commented 1 year ago

High-level overview of roadmap, to discuss at refinement:

image