NASA-IMPACT / veda-backend

Backend services for VEDA
Other
13 stars 5 forks source link

Observability and performance monitoring for delta-backend #126

Open ividito opened 2 years ago

ividito commented 2 years ago

I spent some time digging into observability patterns we could use here to help address #115 . While working on #70 , I learned that the enabled /statistics endpoint was taking over 2 minutes on simple requests, which seemed like a good place to start researching performance monitoring options.

Some goals I set (some of which are more achievable than others)

XRay

AWS XRay would be a decent out-of-the-box option, but I found that our FastAPI implementation is not well-supported.

There are also some gaps that I wasn't able to identify solutions for. S3 interactions didn't seem to be captured by the boto instrumentation, even though SecretsManager interactions showed up fine (see below). URL and Endpoints also weren't included in the metadata for each trace. I think there's a manual instrumentation fix for the latter, but I'm not sure how to address the former.

image

OpenTelemetry

OpenTelemetry is another good option, but would require some less-than-clean implementations, as we would need to manually instrument most of our trace segments and calls to external services. The following function decorator will encapsulate the decorated function with a traced segment, which can be analyzed in an OpenTelemetry backend. We could likely trace most of our application by applying this decorator at runtime to relevant functions we import from dependencies.

from functools import wraps

from aws_xray_sdk.core import xray_recorder

def trace_decorator(segmentname, annotation:str, annotation_data:str):
    def decorator(f):
        @wraps(
            f
        )  # elevates docstring, signature of f, so the decorated instance of f has the same observable properties
        def make_xray_segment(*args, **kwargs):
            with xray_recorder.in_subsegment(segmentname):
                xray_recorder.put_annotation(annotation, annotation_data)
                return f(*args, **kwargs)

        return make_xray_segment

    return decorator

There's a natural development path to adopt OpenTelemetry after XRay, as OpenTelemetry traces can be sent to XRay until a dedicated backend can be created and managed. The other, more subjective OpenTelemetry downside - the docs are very very abstract and can be difficult to get started with.

Cloudwatch

The last solution I looked into was a log-based implementation using Cloudwatch Insights. I find this to be a more clumsy method - it requires the fairly granular creation of logs and traced segments. Analysis would also require some dev-time to produce meaningful data from our logs using Cloudwatch Insights. An initial implementation would track request time and endpoints, but accomplishing other observability goals would be fairly difficult IMO. I think this is a good way to quickly monitor a targeted subsection of the application, but is not a great long-term monitoring strategy.The following is a good template to get some initial FastAPI context attached to logs for requests. I would pair this with a middleware that sends a log at the start and end of a request's function call.

from fastapi import Request, Response
from fastapi.routing import APIRoute
from typing import Callable
from .utils import logger

class LoggerRouteHandler(APIRoute):
    def get_route_handler(self) -> Callable:
        original_route_handler = super().get_route_handler()

        async def route_handler(request: Request) -> Response:
            # Add fastapi context to logs
            ctx = {
                "path": request.url.path,
                "route": self.path,
                "method": request.method,
            }
            logger.append_keys(fastapi=ctx)
            logger.info("Received request")

            return await original_route_handler(request)

        return route_handler
...
from .router import LoggerRouteHandler

app = FastAPI()
app.router.route_class = LoggerRouteHandler
...

Cost Comparison

XRay - Functionally free, unless we increase data retention time. Dev time needed to sort out gaps identified above.

OpenTelemetry - Cost of backend (free-ish with XRay, cost of EC2 + dev time with an OT backend). Risk of bugs introduced by tracing dependency functions at runtime.

Log Insights - cost of log retention, analysis comes with some cost, cost of dev time to log all relevant/needed features

Overall, I would really like to get XRay working well. A lot of the work we would need to do (especially considering our dependencies) would be required to get OpenTelemetry working regardless. I've already laid the groundwork, our delta-backend dev environment is currently being minimally traced (code for this is in #125 ). I think that log insights are best used as a fallback, and should be used to patch holes in our XRay monitoring.

Possible Action Items

ividito commented 2 years ago

cc: @anayeaye @moradology Would love to get your thoughts - any glaring holes or details I missed?

leothomas commented 2 years ago

This is a really interesting write up @ividito!

A lot of our projects have APIs build with FastAPI running on Lambda with an APIGateway integration, and we don't really have a standardized approach to logging and performance monitoring.

If you come up with an solution that is: easy to implement in a repeatable pattern and low risk of introducing bugs/complexity, it would be awesome if you were able to write up a dev log so that we can implement a similar solution across different projects!