freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
529 stars 144 forks source link

Investigate prometheus for docker monitoring #942

Open mlissner opened 5 years ago

mlissner commented 5 years ago

I barely know what this is, but it seems useful: https://prometheus.io/

mlissner commented 5 years ago

I looked into this a good bit yesterday since I killed the new computer and couldn't do much else:

  1. Prometheus works via "scraping" (their word, not mind) various resources, but I think what this really means is that it does pulls on various data sources, whether they're CPU monitoring or running a query on the DB.

  2. Using the data from the above, prometheus creates pretty basic graphs and alerts. It has it's own query language that you have to use to make either of these things (screw that, JFC). Alers can go to Slack, pagerduty, email, etc.

  3. Getting better graphs or more data sources ingested means using add-ons. On the plus side, these bring lots of flexibility and power. On the minus side, I don't love having so many different places I'm reliant on for functionality. Add-ons for data ingestion, managed by outside third parties, feels lame and fragile.

  4. This is installable via Docker, and easily tested that way, but I couldn't easily figure out or deduce how ① a docker container could get stats about the host system, or ② how you'd wind up dockerizing all the various add-ons. Do you make your own docker image that has them all? Who knows. Sounds like a PITA to manage.

So in conclusion:

Compared to munin, I like the alerts and I like the improved graphs. For example it gives you 99th percentile calcs, which munin has failed to implement for...a decade or something, I forget.

A good next step will be to sort out installation of add-ons and to compare the add-ons that are available against what munin gives us.

mlissner commented 5 years ago

I took another look at this today. People seem enthusiastic, but, like, um, I don't get it. Let's say we want to monitor Celery. Fine, so Celery itself exports its stats to Redis as it runs. But prometheus can't read that information, so you need a shim-thing, like this one:

https://github.com/zerok/celery-prometheus-exporter or this one (a fork?): https://github.com/Bahus/celery-prometheus-exporter

Fine. So now we have celery sending info to redis and this thing providing a little REST API that reads that data and serves it in a format that prometheus expects. But prometheus still doesn't know that celery even exists, so our last step is to wire prometheus's scrapers up to gathering this data from the "celery exporter."

So to learn about celery, we store data in redis as we process tasks. prometheus then scrapes the celery data exporter, and the celery data exporter scrapes data from redis. That's a lot of bailing wire holding things together.

I just want alerts when celery is having an issue. Is this really the best way?

mlissner commented 5 years ago

Just looked into this for Solr. Same deal as for Celery, but it also reminds me that Grafana is needed, and the Solr one is built into Solr itself as a contrib module. So...no idea how to make it available for our old version of Solr.

Here's the docs for it: https://lucene.apache.org/solr/guide/7_3/monitoring-solr-with-prometheus-and-grafana.html

mlissner commented 5 years ago

I'd heard good things about prometheus from a few people, so I wanted to give it a solid shot. I took a look at monitoring hardware. That ought to be pretty simple right? Nah: https://github.com/prometheus/node_exporter

Either there's a paradigm that I'm missing here, or prometheus is wildly complex. I can't spend a week or two or three setting up friggin monitoring.

mlissner commented 3 years ago

AWS seems to have hosted prometheus these days, but people say it's competing with datadog, which is maybe expensive?

mlissner commented 2 months ago

We have prometheus set up roughly, but not carefully. The task now is to write up how it works and optimize it to have all the right alerts.

@blancoramiro, can you please make a list of remaining tasks for this so we can make it as good as it should be?