cds-snc / notification-planning-core

Project planning for GC Notify Core Team
0 stars 0 forks source link

Tracing / correlate requests for Celery components (AWS X-Ray) #400

Closed P0NDER0SA closed 5 days ago

P0NDER0SA commented 3 months ago

Description

As an ops lead, I want to link all requests going through the Notify stack that was triggered by the same user / same origin IP, So that I can quickly analyze their behavior and actions on our stack.

WHY are we building?

Better tracing support through our stack.

WHAT are we building?

Implementing correlation / trace IDs through the various Notify components. Potentially using aws x-ray.

VALUE created by our solution

Quicker support and incident response. Better routing / behavior analysis of the requests made against Notify.

Acceptance Criteria** (Definition of done)

QA Steps

Additional info

Celery integration with X-Ray:

The one area I am uncertain with xray is how we will plug this with Celery. This and that seems interesting.

P0NDER0SA commented 3 months ago

The very basic middleware code has been added. The Daemon is awaiting information as well. More work is required.

jimleroyer commented 3 months ago

GDS has this previous utility class to trace their request IDs through the application starting from the admin. We can leverage it and modify it for xray (whereas this was meant initially for zipkin from Twitter): https://github.com/cds-snc/notification-utils/blob/623d109f9145996d7f32eacdada190384d8a5ffa/notifications_utils/request_helper.py

P0NDER0SA commented 3 months ago

Jimmy Royer :spiral_calendar_pad: 1 hour ago https://flask.palletsprojects.com/en/2.3.x/patterns/celery/ You can use Celery without any integration with Flask, but it’s convenient to configure it through Flask’s config, and to let tasks access the Flask application. Celery uses similar ideas to Flask, with a Celery app object that has configuration and registers tasks. While creating a Flask app, use the following code to create and configure a Celery app as well.

NEW

1 reply

Jimmy Royer :spiral_calendar_pad: 1 hour ago For a distributed application, X-Ray combines nodes from all services that process requests with the same trace ID into a single service graph. The first service that the request hits adds a tracing header that is propagated between the front end and services that it calls. For example, Scorekeep runs a web API that calls a microservice (an AWS Lambda function) to generate a random name by using a Node.js library. The X-Ray SDK for Java generates the trace ID and includes it in calls to Lambda. Lambda sends tracing data and passes the trace ID to the function. The X-Ray SDK for Node.js also uses the trace ID to send data. As a result, nodes for the API, the Lambda service, and the Lambda function all appear as separate, but connected, nodes on the trace map. Service graph data is retained for 30 days.

P0NDER0SA commented 3 months ago

https://github.com/cds-snc/notification-api/pull/2248

jimleroyer commented 3 months ago

Next steps:

  1. Implement a switch to disable x-ray on demand.
  2. Merge the PR that adds more logging for the Celery custom SDK.
jimleroyer commented 3 months ago

Potential fix for the Celery errors, based on an analysis of the enabled log statements: https://github.com/cds-snc/notification-api/pull/2259

The rationale of the fix is contained in the PR description.

P0NDER0SA commented 3 months ago

Looking to investigate and work on solutions for easy switches to turn Xray On/Off by component, or as a whole

P0NDER0SA commented 3 months ago

https://github.com/cds-snc/notification-manifests/pull/2869

P0NDER0SA commented 3 months ago

https://github.com/cds-snc/notification-api/pull/2261

jimleroyer commented 3 months ago

I investigated further the logs around xray. There is a missing piece I don't get at the moment. The new registered signal for beat workers is wired to the xray_task_prerun handler function and it works. But it is not called for all periodic tasks and for these it is not called, the xray_before_task_publish is called, which cause the errors. What I would normally expect is xray_task_prerun -> xray_before_task_publish but that is not happening. I will take a closer look tomorrow on how that workflow executes for non-periodic tasks, then find the differentiator between periodic task that executes one or the other handler functions, hopefully so that we can execute both.. or maybe bring new logic that supplant the previous one and tailored for periodic tasks.

jimleroyer commented 3 months ago

We got the switch merged into the mainline. We will polish a bit the feature.

P0NDER0SA commented 3 months ago

I'm going to tweak the switch code to use a boolean library in api, then add the same code to admin and doc download. Then i'm going to modify the .env files to have prod set to false and staging to true

P0NDER0SA commented 3 months ago

https://github.com/cds-snc/notification-api/pull/2262

P0NDER0SA commented 3 months ago

updating everything to use the AWS_XRAY_SDK_ENABLED flag that's built into the XRay SDK rather than our custom switches! https://github.com/cds-snc/notification-manifests/pull/2876

P0NDER0SA commented 3 months ago

and moving the api code back for this same reason: https://github.com/cds-snc/notification-api/pull/2265

P0NDER0SA commented 3 months ago

This is pushed to production. We have a flag set and it looks to be working.

P0NDER0SA commented 3 months ago

turning off for production api lambdas https://github.com/cds-snc/notification-terraform/pull/1502

jimleroyer commented 3 months ago

I am currently blocked on setting default values for twilio variables, that we would like to keep in there for a potential integration soon. I removed these in current PR if there are no ways that TG/TF cannot handle it. We will meet today to talk of how we can keep them and if not, move forward with getting rid of these for the time being.

P0NDER0SA commented 2 months ago

More documentation on Python tracing gives us some potential options to easily trace traffic through the admin, api and celery.
https://docs.aws.amazon.com/xray/latest/devguide/xray-sdk-python.html

trying out the same names and patching in the following PRs: https://github.com/cds-snc/notification-api/pull/2274 https://github.com/cds-snc/notification-admin/pull/1934

jimleroyer commented 2 months ago

The logging errors are now transformed into warnings, hence these should not trigger our alarms. The env var that we explicitly set today for the errors handling to do log warning should reinforce this strategy.

ben851 commented 2 months ago

We're running into a problem with the lambda sdk environment variable in prod, even though the environment variable is set. Will investigate.

P0NDER0SA commented 2 months ago

Adding xray patching: https://github.com/cds-snc/notification-admin/pull/1935 https://github.com/cds-snc/notification-api/pull/2276

P0NDER0SA commented 2 months ago

Adding patching to the code looks to have things working! traces are passing through from Admin to the APIs and Celery is tracing traffic through its code path

sastels commented 2 months ago

Steve will look at today for sure this time!

sastels commented 2 months ago

I swear I'll do it today

P0NDER0SA commented 2 months ago

STEVE SWEARS HE'LL LOOK TODAY