Investigate adding logging and observability tooling

MgenGlder commented 6 months ago

Summary

When the service is up and running, we'll need a good way to monitor its behavior and determine if it ever becomes unhealthy. Some potential offerings that might support this are Splunk for logging and DataDog, Graphite, or Graphana for observability and monitoring. Open to other suggestions.

MgenGlder commented 6 months ago

I gave this a bit of thought and I considered:

Graphite, Grafana, and Prometheus are generally used together as a bundle for observability and data collection.
- Generally, the developer needs to set up the infrastructure to maintain all of these open source offerings, so it's a bit more of a workload to get up and running- and also to maintain.
DataDog comes battery included, with an agent, storage, querying, and a dashboard all in one. Its main downside is that it is generally decent at most things, but not particularly stellar at anything.
- This might be ok for our purposes because we don't have any super specific observability/visualization needs.

As I've researched this a bit, I've also discovered that there are many alternatives to Splunk- including DataDog and Grafana. Many of our data visualization tools now come with log management built-in, so it is no longer necessary to have an entirely separate service to handle these. I'll need to investigate how these work to determine viability, but we may be able to just rely on one of these offering for all of our monitoring needs.

ctcarrier commented 6 months ago

@MgenGlder I've used Grafana Cloud before which is their managed service. So less to setup compared to self hosting. Looks like they have a free tier.

It looks like DataDog also has a free tier. I've never used DataDog before so am interested for my own learning.

Does the host we are using provide some kind of plugins for integrating with either platform? For Prometheus at least the default is a pull based system where you expose your metrics on an endpoint that is polled. They also have operators that will push data. But either of these options require something running on the infrastructure to expose the metrics. I assume DataDog works somewhat similarly.

Another aspect of this that could be interesting is instrumenting the code itself using something like OpenTelemetry. This can provide much more granular information on how the code is performing. Seems like there are integrations for Python/Django:

https://opentelemetry.io/docs/languages/python/automatic/

MgenGlder commented 6 months ago

Another aspect of this that could be interesting is instrumenting the code itself using something like OpenTelemetry. This can provide much more granular information on how the code is performing. Seems like there are integrations for Python/Django:

@ctcarrier This is exactly what I was thinking as well- I'm in support of using OTel to help with instrumenting fine-grained metrics/observability.

It looks like DataDog also has a free tier. I've never used DataDog before so am interested for my own learning.

I'd be happy to give DataDog a shot. We could also set up a few quick prototypes and experiment on both the Grafana and DataDog stack and see which we like the best.

Does the host we are using provide some kind of plugins for integrating with either platform? For Prometheus at least the default is a pull based system where you expose your metrics on an endpoint that is polled. They also have operators that will push data. But either of these options require something running on the infrastructure to expose the metrics. I assume DataDog works somewhat similarly.

My apologies, we haven't documented our hosting service anywhere yet- we're using fly.io. It looks like they do have integrations for DataDog and the Prometheus stack. I haven't had much time to delve into the implementation details of pull vs push, but my hunch is that it will be push based (i.e. we push metrics through an agent of some sort).

MgenGlder commented 6 months ago

It looks like fly.io comes with Grafana integration by default. This includes logging, metrics tracking, etc. I'm not sure if we'll need anything else but this at least is something.

dearborn-coding-club / website-base-backend

Investigate adding logging and observability tooling #13

Summary