Observability via Prometheus Metrics

mooster531 commented 3 days ago

Closes #57

mooster531 commented 3 days ago

This looks pretty cool! I think that most of the changes in main could be moved into the job class. Like a job computing its own id and logging its own metrics. Otherwise I'm keen to give it a try!

Not sure if those fall under the responsibilities of the Job class as we've built it up so far, but anyway this needs more work and I will consider it in the meantime.

bh2smith commented 2 days ago

Actually that may be the case I was mostly worried about logs that say main.main. Maybe we can just rename the main function to "run".

mooster531 commented 2 days ago

You could take a look at what we have now

Check out the branch
Start the entire compose stack docker compose up -d --force-recreate
Open http://localhost:3000 in a browser
Login with admin/admin
Go to the dashboards and find "Dune Sync"
Run main a few times and watch what you get in the Grafana dashboard

What worries me and I haven't been able to find a solution - the metrics specify the durations of jobs by name, but I can't find a way to include the individual run ID when submitting them so you could actually see which job took unexpectedly long, if that ever happens.

Prometheus metrics may contain custom labels, but if we put the run ID in a custom label then we get an individual metric per job run which looks... not very helpful:

Honestly I'm a little lost on how to structure this.

bh2smith commented 2 days ago

Cool! I'll look into it. My first thought about the naming is to use the job name as the id... or better yet, we could define a nice repr or hash function for the job class that gives a unique informative label. If that's not an option perhaps we can use "labels" on the recordings so the metrics can be grouped by label. 🏷️

From your screenshot it already looks like you have the job name in there. I would still suggest coming up with a more compact representation of the job without the long sentence and including details about the databases involved.

mooster531 commented 2 days ago

We should ignore and delete the graphana directory.

We will, I only committed it here so you can have the dashboard ready to go without having to set it up yourself

mooster531 commented 2 days ago

Grouping by Run ID just gives us a 1 for each "duration"? I really don't get how this thing works

bh2smith commented 2 days ago

Oh - well It was changed presumably when I renamed my jobs.

I got as far as seeing "no data" in the dashboard.

The prometheus container kept dying with

level=ERROR source=main.go:601 msg="Error loading config (--config.file=/etc/prometheus/config.yml)" file=/etc/prometheus/config.yml err="open /etc/prometheus/config.yml: no such file or directory"

Fixed this by adding prometheus/config.yml like so

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'pushgateway'
    static_configs:
      - targets: ['pushgateway:9091']

bh2smith commented 2 days ago

Grouping by Run ID just gives us a 1 for each "duration"? I really don't get how this thing works

I am suggesting grouping by "label"

I will play around with this and see what we can do.

bh2smith commented 2 days ago

I would say that this all looks pretty good on my end. You can also change the poll frequency of the dune queries to see some different times.

These things are already grouped by label (aka job name).

Only thing missing is some way of counting failures. Basically we just need a counter that should always be zero. If its ever not zero someone would want an alert.

For large config files with multiple "Dune 2 Postgres" jobs, the user will have to be careful with the poll frequency configuration and potential rate limiting on the "getResults" requests. This is likely a place where the job could fail. We may want to return the execution ID on failures like this and implement a backoff-retry mechanism. However this should probably live in the dune-client package.

bh2smith commented 2 days ago

So things are looking great here now and the tests are all passing.

My only last concern is that the metrics logic is intertwined now with the job run function. So we collect metrics and then at the very end only actually post them if the env var is set. I am wondering if there is some kind of wrapper/decorator we can use to wrap the metrics around the job runner only if the optional env var is set and ready (i.e. exists and can be pinged). Otherwise, the metrics wouldn't be collected at all.

This can also be saved for another issue, but if we don't implement it here, I would like to at least log it so its not forgotten.

mooster531 commented 2 days ago

Made a counter for the fails, and a decorator for metrics gathering. Feel free to merge if happy :)

bh2smith commented 1 day ago

Really nice work here. I think this is ready to squash! 🚢

✨ 🎉

bh2smith / dune-sync

Observability via Prometheus Metrics #92