Open ividito opened 2 years ago
cc: @anayeaye @moradology Would love to get your thoughts - any glaring holes or details I missed?
This is a really interesting write up @ividito!
A lot of our projects have APIs build with FastAPI running on Lambda with an APIGateway integration, and we don't really have a standardized approach to logging and performance monitoring.
If you come up with an solution that is: easy to implement in a repeatable pattern and low risk of introducing bugs/complexity, it would be awesome if you were able to write up a dev log so that we can implement a similar solution across different projects!
I spent some time digging into observability patterns we could use here to help address #115 . While working on #70 , I learned that the enabled
/statistics
endpoint was taking over 2 minutes on simple requests, which seemed like a good place to start researching performance monitoring options.Some goals I set (some of which are more achievable than others)
XRay
AWS XRay would be a decent out-of-the-box option, but I found that our FastAPI implementation is not well-supported.
psycopg
, which is a blocker for tracing our PostGreSQL queries.There are also some gaps that I wasn't able to identify solutions for. S3 interactions didn't seem to be captured by the
boto
instrumentation, even though SecretsManager interactions showed up fine (see below). URL and Endpoints also weren't included in the metadata for each trace. I think there's a manual instrumentation fix for the latter, but I'm not sure how to address the former.OpenTelemetry
OpenTelemetry is another good option, but would require some less-than-clean implementations, as we would need to manually instrument most of our trace segments and calls to external services. The following function decorator will encapsulate the decorated function with a traced segment, which can be analyzed in an OpenTelemetry backend. We could likely trace most of our application by applying this decorator at runtime to relevant functions we import from dependencies.
There's a natural development path to adopt OpenTelemetry after XRay, as OpenTelemetry traces can be sent to XRay until a dedicated backend can be created and managed. The other, more subjective OpenTelemetry downside - the docs are very very abstract and can be difficult to get started with.
Cloudwatch
The last solution I looked into was a log-based implementation using Cloudwatch Insights. I find this to be a more clumsy method - it requires the fairly granular creation of logs and traced segments. Analysis would also require some dev-time to produce meaningful data from our logs using Cloudwatch Insights. An initial implementation would track request time and endpoints, but accomplishing other observability goals would be fairly difficult IMO. I think this is a good way to quickly monitor a targeted subsection of the application, but is not a great long-term monitoring strategy.The following is a good template to get some initial FastAPI context attached to logs for requests. I would pair this with a middleware that sends a log at the start and end of a request's function call.
Cost Comparison
XRay - Functionally free, unless we increase data retention time. Dev time needed to sort out gaps identified above.
OpenTelemetry - Cost of backend (free-ish with XRay, cost of EC2 + dev time with an OT backend). Risk of bugs introduced by tracing dependency functions at runtime.
Log Insights - cost of log retention, analysis comes with some cost, cost of dev time to log all relevant/needed features
Overall, I would really like to get XRay working well. A lot of the work we would need to do (especially considering our dependencies) would be required to get OpenTelemetry working regardless. I've already laid the groundwork, our
delta-backend
dev environment is currently being minimally traced (code for this is in #125 ). I think that log insights are best used as a fallback, and should be used to patch holes in our XRay monitoring.Possible Action Items
yappi
andsnakeviz
. These can be used for targeted local performance profiling.