department-of-veterans-affairs / notification-api

Notification API
MIT License
16 stars 8 forks source link

Datadog - Profiler #1120

Closed ldraney closed 1 year ago

ldraney commented 1 year ago

Value Statement

As a DevOps Engineer I want to enable code-level performance profiles and tracing So that we can observe the performance of our code base

Additional Info and Resources

_Overview - From Datadog Profiler Documentation_

A profiler shows how much “work” each function is doing by collecting data about the program as it’s running. For example, if infrastructure monitoring shows your app servers are using 80% of their CPU, you may not know why. Profiling shows a breakdown of the work, for example: FUNCTION CPUUSAGE
doSomeWork 48%
renderGraph 19%
Other 13%

In this table, the first row contains the column headers "FUNCTION" and "CPUUSAGE". The When working on performance problems, this information is important because many programs spend a lot of time in a few places, which may not be obvious. Guessing at which parts of a program to optimize causes engineers to spend a lot of time with little results. By using a profiler, you can find exactly which parts of the code to optimize.

If you’ve used an APM tool, you might think of profiling like a “deeper” tracer that provides a fine grained view of your code without needing any instrumentation.

The Datadog Continuous Profiler can track various types of “work”, including CPU usage, amount and types of objects being allocated in memory, time spent waiting to acquire locks, amount of network or file I/O, and more. The profile types available depend on the language being profiled.

Engineering Checklist

Acceptance Criteria

Have Profiles set up similar to what is available on vagov.ddog-gov.com(You'll have to have looged into the okta dashboard for this link to work) image

We should see something like (I'm not sure the exact names of functions/endpoints for our endpoints, but in general): | flask | 48% | | flask.sms | 19% | | flask.email | 13% |

GIVEN we have active Endpoints (sms) WHEN those endpoints are being used THEN we can see the performance of those parts of our code

Identify how long an sms request took start to finish within the post_notification method and compare that to what is reported in cloudwatch.

tabinda-syed commented 1 year ago

Per Lucas, there's a potential next ticket: If the acceptance criteria are not met after a correct installation, but we see the profiler showing up, then we need to implement Unified Service Tagging and make sure it works with our APM setup.

tabinda-syed commented 1 year ago

@ldraney Is my understanding correct that this wouldn't have any specific QA considerations just yet..? Because we're in the building blocks stage, right..? If yes, I want to go ahead and ask @k-macmillan and @mjones-oddball for a review as we'd like to go ahead and bring this ticket into our sprint.

ldraney commented 1 year ago

We are still not getting data for our deliver_sms endpoint, so we are going to follow Datadogs Python Custom Instrumentation, which is basically to add a decorator to our desired functions. I'll be adding it to app/celery/provider_tasks.py deliver_sms

ldraney commented 1 year ago

Okay, this will be the final update on this ticket before closing it to create a new ticket, related but will be a new story to understand why the profiler is giving us information -- not the information we were expecting AND/OR incomplete information.

Our tracing shows the deliver_sms endpoint being used: image.png

And, the decorate I added gives us new information whenever I send an sms in the dev environment: image.png

But the profiler has not changed: image.png

ldraney commented 1 year ago

Again, closing now, please read the above comment. I will address this in upcoming Datadog meetings and study, and next ticket creation related to PR #1024

The primary goal for the next ticket will either to be figure out this profiler bug, or perhaps to dive deeper into instrumentation; or perhaps related to the unified tagging schema.