Observability and performance monitoring for delta-backend

ividito commented 2 years ago

I spent some time digging into observability patterns we could use here to help address #115 . While working on #70 , I learned that the enabled /statistics endpoint was taking over 2 minutes on simple requests, which seemed like a good place to start researching performance monitoring options.

Some goals I set (some of which are more achievable than others)

For every request, I should be able to track the endpoint and response code. I would also like to keep some information on user inputs (particularly IDs - which searches are being requested most frequently?)
For every request, I should know how much time is spent waiting on external services, including auth, S3, and SQL services.
I should be able to see an aggregate of the above, so that I can easily identify anomalies and likely pain points.

XRay

AWS XRay would be a decent out-of-the-box option, but I found that our FastAPI implementation is not well-supported.

The XRay SDK doesn't support latest versions of psycopg, which is a blocker for tracing our PostGreSQL queries.
Async functions require special handling, which is hard to insert into our dependencies.

There are also some gaps that I wasn't able to identify solutions for. S3 interactions didn't seem to be captured by the boto instrumentation, even though SecretsManager interactions showed up fine (see below). URL and Endpoints also weren't included in the metadata for each trace. I think there's a manual instrumentation fix for the latter, but I'm not sure how to address the former.

OpenTelemetry

OpenTelemetry is another good option, but would require some less-than-clean implementations, as we would need to manually instrument most of our trace segments and calls to external services. The following function decorator will encapsulate the decorated function with a traced segment, which can be analyzed in an OpenTelemetry backend. We could likely trace most of our application by applying this decorator at runtime to relevant functions we import from dependencies.

from functools import wraps

from aws_xray_sdk.core import xray_recorder

def trace_decorator(segmentname, annotation:str, annotation_data:str):
    def decorator(f):
        @wraps(
            f
        )  # elevates docstring, signature of f, so the decorated instance of f has the same observable properties
        def make_xray_segment(*args, **kwargs):
            with xray_recorder.in_subsegment(segmentname):
                xray_recorder.put_annotation(annotation, annotation_data)
                return f(*args, **kwargs)

        return make_xray_segment

    return decorator

There's a natural development path to adopt OpenTelemetry after XRay, as OpenTelemetry traces can be sent to XRay until a dedicated backend can be created and managed. The other, more subjective OpenTelemetry downside - the docs are very very abstract and can be difficult to get started with.

Cloudwatch

The last solution I looked into was a log-based implementation using Cloudwatch Insights. I find this to be a more clumsy method - it requires the fairly granular creation of logs and traced segments. Analysis would also require some dev-time to produce meaningful data from our logs using Cloudwatch Insights. An initial implementation would track request time and endpoints, but accomplishing other observability goals would be fairly difficult IMO. I think this is a good way to quickly monitor a targeted subsection of the application, but is not a great long-term monitoring strategy.The following is a good template to get some initial FastAPI context attached to logs for requests. I would pair this with a middleware that sends a log at the start and end of a request's function call.

from fastapi import Request, Response
from fastapi.routing import APIRoute
from typing import Callable
from .utils import logger

class LoggerRouteHandler(APIRoute):
    def get_route_handler(self) -> Callable:
        original_route_handler = super().get_route_handler()

        async def route_handler(request: Request) -> Response:
            # Add fastapi context to logs
            ctx = {
                "path": request.url.path,
                "route": self.path,
                "method": request.method,
            }
            logger.append_keys(fastapi=ctx)
            logger.info("Received request")

            return await original_route_handler(request)

        return route_handler

...
from .router import LoggerRouteHandler

app = FastAPI()
app.router.route_class = LoggerRouteHandler
...

Cost Comparison

XRay - Functionally free, unless we increase data retention time. Dev time needed to sort out gaps identified above.

OpenTelemetry - Cost of backend (free-ish with XRay, cost of EC2 + dev time with an OT backend). Risk of bugs introduced by tracing dependency functions at runtime.

Log Insights - cost of log retention, analysis comes with some cost, cost of dev time to log all relevant/needed features

Overall, I would really like to get XRay working well. A lot of the work we would need to do (especially considering our dependencies) would be required to get OpenTelemetry working regardless. I've already laid the groundwork, our delta-backend dev environment is currently being minimally traced (code for this is in #125 ). I think that log insights are best used as a fallback, and should be used to patch holes in our XRay monitoring.

Possible Action Items

To get better XRay tracing, we would need to develop instrumentation that can be inserted into our current (newer) version of Psycopg. We could take inspiration from the existing SDK for this, since it contains patchers for older SQL libraries. This could be something that we open-source as well, if we decide to do a really good job with it.
To get a targeted, lower-effort monitoring solution in place, we should implement basic logs on our endpoints, and discuss areas for more detailed logs to be inserted. This discussion should also consider which trace features are mission-critical, as each feature traced in this way could make it more difficult to analyze our logs effectively.
I would like to get #70 out the door without being blocked by this, so I'll be writing up a similar document focused on local profiling using yappi and snakeviz . These can be used for targeted local performance profiling.

ividito commented 2 years ago

cc: @anayeaye @moradology Would love to get your thoughts - any glaring holes or details I missed?

leothomas commented 2 years ago

This is a really interesting write up @ividito!

A lot of our projects have APIs build with FastAPI running on Lambda with an APIGateway integration, and we don't really have a standardized approach to logging and performance monitoring.

If you come up with an solution that is: easy to implement in a repeatable pattern and low risk of introducing bugs/complexity, it would be awesome if you were able to write up a dev log so that we can implement a similar solution across different projects!

NASA-IMPACT / veda-backend

Observability and performance monitoring for delta-backend #126