Closed mooster531 closed 1 day ago
This looks pretty cool! I think that most of the changes in main could be moved into the job class. Like a job computing its own id and logging its own metrics. Otherwise I'm keen to give it a try!
Not sure if those fall under the responsibilities of the Job class as we've built it up so far, but anyway this needs more work and I will consider it in the meantime.
Actually that may be the case I was mostly worried about logs that say main.main. Maybe we can just rename the main function to "run".
You could take a look at what we have now
docker compose up -d --force-recreate
admin/admin
main
a few times and watch what you get in the Grafana dashboardWhat worries me and I haven't been able to find a solution - the metrics specify the durations of jobs by name, but I can't find a way to include the individual run ID when submitting them so you could actually see which job took unexpectedly long, if that ever happens.
Prometheus metrics may contain custom labels, but if we put the run ID in a custom label then we get an individual metric per job run which looks... not very helpful:
Honestly I'm a little lost on how to structure this.
Cool! I'll look into it. My first thought about the naming is to use the job name as the id... or better yet, we could define a nice repr or hash function for the job class that gives a unique informative label. If that's not an option perhaps we can use "labels" on the recordings so the metrics can be grouped by label. 🏷️
From your screenshot it already looks like you have the job name in there. I would still suggest coming up with a more compact representation of the job without the long sentence and including details about the databases involved.
We should ignore and delete the graphana directory.
We will, I only committed it here so you can have the dashboard ready to go without having to set it up yourself
Grouping by Run ID just gives us a 1 for each "duration"? I really don't get how this thing works
Oh - well It was changed presumably when I renamed my jobs.
I got as far as seeing "no data" in the dashboard.
The prometheus container kept dying with
level=ERROR source=main.go:601 msg="Error loading config (--config.file=/etc/prometheus/config.yml)" file=/etc/prometheus/config.yml err="open /etc/prometheus/config.yml: no such file or directory"
Fixed this by adding prometheus/config.yml
like so
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'pushgateway'
static_configs:
- targets: ['pushgateway:9091']
Grouping by Run ID just gives us a 1 for each "duration"? I really don't get how this thing works
I am suggesting grouping by "label"
I will play around with this and see what we can do.
I would say that this all looks pretty good on my end. You can also change the poll frequency of the dune queries to see some different times.
These things are already grouped by label (aka job name).
Only thing missing is some way of counting failures. Basically we just need a counter that should always be zero. If its ever not zero someone would want an alert.
For large config files with multiple "Dune 2 Postgres" jobs, the user will have to be careful with the poll frequency configuration and potential rate limiting on the "getResults" requests. This is likely a place where the job could fail. We may want to return the execution ID on failures like this and implement a backoff-retry mechanism. However this should probably live in the dune-client package.
So things are looking great here now and the tests are all passing.
My only last concern is that the metrics logic is intertwined now with the job run function. So we collect metrics and then at the very end only actually post them if the env var is set. I am wondering if there is some kind of wrapper/decorator we can use to wrap the metrics around the job runner only if the optional env var is set and ready (i.e. exists and can be pinged). Otherwise, the metrics wouldn't be collected at all.
This can also be saved for another issue, but if we don't implement it here, I would like to at least log it so its not forgotten.
Made a counter for the fails, and a decorator for metrics gathering. Feel free to merge if happy :)
Really nice work here. I think this is ready to squash! 🚢
✨ 🎉
Closes #57