avaje / avaje-metrics

Core implementation of avaje metric api
http://avaje-metrics.github.io
7 stars 0 forks source link

9 years on, should avaje metrics continue? Yes #55

Open rbygrave opened 2 years ago

rbygrave commented 2 years ago

avaje-metrics as a project has been going now for 9 years. Fair to say it is unloved and unknown in the community so the question is should it continue to exist?

Dropwizard Metrics

The original reason avaje-metrics was created was because I didn't believe the benefit of Histogram exceeded its cost and difficulty in aggregating (how do you aggregate a 50th percentile?) WHEN we collect metrics often enough (like every minute) and alternatively can use bucket timer which is a lot less expensive that Histograms. All these things still hold true. I also note that there looks to be some hesitation in maintaining Dropwizard metrics it going forward.

Micrometer

Well the idea they are going for sounds good but micrometer isn't very micro at all - over 600kb and looking at the API and internals I'm not overly convinced by it.

What should avaje-metrics be going forward?

So that is what I've done and I've been brutal in the sense that I've "modernised" the API, types and modularized it better. Given the tiny number of existing users out there I've been brutal and significantly changed the API without concern for backward compatibility - apologies for that if people are looking to upgrade etc.

Having done the big refactor I'm very happy with the result. A heck of a lot of avaje-metrics was pretty good but I think this refactor gives it a lot more consistency with Dropwizard and Micrometer and I think it's a lot more future orientated. So yes, happy.

agentgt commented 10 months ago

@rbygrave FWIW we have our own facade and the reason is other than dropwizard aka codahale most of them are massively over complicated with jar sizes in the megs.

Particularly micrometer. They lazily just put all the shit in one jar.

I just found this project and was unaware you basically did what I did internally.

My personal belief is JFR, Prometheus, and Open Telemetry is the future and that reporting is actually the more complicated part because you either allow scraping which means having some endpoint open or you are pushing. Both can be and are often pretty painful in terms of dependency hell (open telemetry and prometheus come to mind).

I'm curious how you are reporting. I assume you are using Graphite or Collectd.

Those guys are kind of not popular as much anymore.

The newer guys seem to be Vector (think Collectd), Prometheus (basically graphite), and Open Telemetry (basically agents).

For storage I'm embarrassed to say we used Timescaledb but if Timescale gets more closed source as so many of these opensource projects that are backed by investors we may have to change.

Anyway I just wanted to give you some insight on what we do.

Like avaje-config I will look to see if we can use avaje-metrics instead of our internal stuff.

rbygrave commented 1 month ago

Reporting

Mostly Graphite and yes some old Collectd. I might be adding Prometheus format but I think it's more likely to continue with Graphite. This goes into Clickhouse storage with Grafana front end.

Open Telemetry (basically agents).

Pretty much all the time I use metrics-agent / metrics-maven-plugin ... and enhance either:

... plus then some extra counters for various application needs. The result is pretty good APM observability with a stacked bar of "Sum Total time of every public component method" - this basically tells you where the time is going. Typically there is some "double counting" with components calling components so you can end up adding in some @NotTimed for those if you don't want the stacked time to "double up".

reporting ... are pushing

I've generally preferred pushing and the only trick there is pushing when running in AWS Lambdas because you want to push in the background but don't want to suspend half way through pushing metrics via graphite - but I've pretty much solved that and avaje-metrics comes with a ScheduledTask to make that work well, the lambda has a try finally and inside the finally you call - ScheduledTask.waitIfRunning()

So I'm likely just going to stick with pushing style but I am interested in Prometheus format.

Just to say, Ebean ORM collects metrics on all queries executed and automatically gives them decent names so ... I also tend to publish those and present them the same way. That way you see "Sum of Total time by query" in a stacked bar per minute. I tend to view "Mean time by query" in a dot plot.

So with "Component Time" + "Query Time" in 2 stacked bars per minute it's pretty easy to see performance issues. The mean query time shows up the slow queries that aren't run often (so they are not prominent in total time) and I like to review and optimise those as well.


The docs for this is pretty absent / missing ... I need to fix that.