Closed P0NDER0SA closed 5 days ago
The very basic middleware code has been added. The Daemon is awaiting information as well. More work is required.
GDS has this previous utility class to trace their request IDs through the application starting from the admin. We can leverage it and modify it for xray (whereas this was meant initially for zipkin from Twitter): https://github.com/cds-snc/notification-utils/blob/623d109f9145996d7f32eacdada190384d8a5ffa/notifications_utils/request_helper.py
Jimmy Royer :spiral_calendar_pad: 1 hour ago https://flask.palletsprojects.com/en/2.3.x/patterns/celery/ You can use Celery without any integration with Flask, but it’s convenient to configure it through Flask’s config, and to let tasks access the Flask application. Celery uses similar ideas to Flask, with a Celery app object that has configuration and registers tasks. While creating a Flask app, use the following code to create and configure a Celery app as well.
NEW
1 reply
Jimmy Royer :spiral_calendar_pad: 1 hour ago For a distributed application, X-Ray combines nodes from all services that process requests with the same trace ID into a single service graph. The first service that the request hits adds a tracing header that is propagated between the front end and services that it calls. For example, Scorekeep runs a web API that calls a microservice (an AWS Lambda function) to generate a random name by using a Node.js library. The X-Ray SDK for Java generates the trace ID and includes it in calls to Lambda. Lambda sends tracing data and passes the trace ID to the function. The X-Ray SDK for Node.js also uses the trace ID to send data. As a result, nodes for the API, the Lambda service, and the Lambda function all appear as separate, but connected, nodes on the trace map. Service graph data is retained for 30 days.
Next steps:
Potential fix for the Celery errors, based on an analysis of the enabled log statements: https://github.com/cds-snc/notification-api/pull/2259
The rationale of the fix is contained in the PR description.
Looking to investigate and work on solutions for easy switches to turn Xray On/Off by component, or as a whole
I investigated further the logs around xray. There is a missing piece I don't get at the moment. The new registered signal for beat workers is wired to the xray_task_prerun handler function and it works. But it is not called for all periodic tasks and for these it is not called, the xray_before_task_publish is called, which cause the errors. What I would normally expect is xray_task_prerun -> xray_before_task_publish but that is not happening. I will take a closer look tomorrow on how that workflow executes for non-periodic tasks, then find the differentiator between periodic task that executes one or the other handler functions, hopefully so that we can execute both.. or maybe bring new logic that supplant the previous one and tailored for periodic tasks.
We got the switch merged into the mainline. We will polish a bit the feature.
I'm going to tweak the switch code to use a boolean library in api, then add the same code to admin and doc download. Then i'm going to modify the .env files to have prod set to false and staging to true
updating everything to use the AWS_XRAY_SDK_ENABLED flag that's built into the XRay SDK rather than our custom switches! https://github.com/cds-snc/notification-manifests/pull/2876
and moving the api code back for this same reason: https://github.com/cds-snc/notification-api/pull/2265
This is pushed to production. We have a flag set and it looks to be working.
turning off for production api lambdas https://github.com/cds-snc/notification-terraform/pull/1502
I am currently blocked on setting default values for twilio variables, that we would like to keep in there for a potential integration soon. I removed these in current PR if there are no ways that TG/TF cannot handle it. We will meet today to talk of how we can keep them and if not, move forward with getting rid of these for the time being.
More documentation on Python tracing gives us some potential options to easily trace traffic through the admin, api and celery.
https://docs.aws.amazon.com/xray/latest/devguide/xray-sdk-python.html
trying out the same names and patching in the following PRs: https://github.com/cds-snc/notification-api/pull/2274 https://github.com/cds-snc/notification-admin/pull/1934
The logging errors are now transformed into warnings, hence these should not trigger our alarms. The env var that we explicitly set today for the errors handling to do log warning should reinforce this strategy.
We're running into a problem with the lambda sdk environment variable in prod, even though the environment variable is set. Will investigate.
Adding patching to the code looks to have things working! traces are passing through from Admin to the APIs and Celery is tracing traffic through its code path
Steve will look at today for sure this time!
I swear I'll do it today
STEVE SWEARS HE'LL LOOK TODAY
Description
As an ops lead, I want to link all requests going through the Notify stack that was triggered by the same user / same origin IP, So that I can quickly analyze their behavior and actions on our stack.
WHY are we building?
Better tracing support through our stack.
WHAT are we building?
Implementing correlation / trace IDs through the various Notify components. Potentially using aws x-ray.
VALUE created by our solution
Quicker support and incident response. Better routing / behavior analysis of the requests made against Notify.
Acceptance Criteria** (Definition of done)
QA Steps
Additional info
Celery integration with X-Ray:
The one area I am uncertain with xray is how we will plug this with Celery. This and that seems interesting.