Closed mikelorant closed 5 months ago
@leklund Would you be able to do a preliminary review, please?
@mikelorant Sorry this has been sitting for so long. I'll take a look at this early next week.
@leklund Thanks for the all the feedback!
All your points make sense and I am happy to make all the changes you requested. I'll expect to have your requested changes implemented by the time you get back.
Breaking this pull request up into smaller pieces.
@mikelorant Thanks for the new PR! I spent some time looking at this today and had some thought/ideas. Left a summary at the top because it got long.
I had a realization that instead of adding complexity to remove the datacenter label you could instead just use a constant for the datacenter and it will have the same desired effect of reducing the /metrics response size. It could also skip the datacenter iteration completely and just use the Aggreagate
field from the Response struct (that's the last sample code at the very bottom).
I was curious about performance with all the extra map allocations. I wrote a simple benchmark for Process():
main
branch:
pkg: [github.com/fastly/fastly-exporter/pkg/realtime](http://github.com/fastly/fastly-exporter/pkg/realtime)
BenchmarkProcess-12 2840 419057 ns/op 1883 B/op 57 allocs/op
mikelorant:refactor/labels
using mergeLabels:
pkg: [github.com/fastly/fastly-exporter/pkg/realtime](http://github.com/fastly/fastly-exporter/pkg/realtime)
BenchmarkProcess-12 1012 999219 ns/op 264528 B/op 1640 allocs/op
I haven't tested out the real-world implications but it could be noticeable if you happen to use the exporter with a large number of service.
I did a quick pass to use a slices instead of the map and it's better:
pkg: [github.com/fastly/fastly-exporter/pkg/realtime](http://github.com/fastly/fastly-exporter/pkg/realtime)
BenchmarkProcess-12 2558 465463 ns/op 74551 B/op 815 allocs/op
That last benchmark used this code:
func Process(response *Response, serviceID, serviceName, serviceVersion string, m *Metrics) {
labels := make([]string, 3, 3)
labels[0] = serviceID
labels[1] = serviceName
for _, d := range response.Data {
for datacenter, stats := range d.Datacenter {
labels[2] = datacenter
process(labels, stats, m)
}
}
}
func process(labels []string, stats Datacenter, m *Metrics) {
...
m.StatusCodeTotal.WithLabelValues(append(labels, "200")...).Add(float64(stats.Status200))
...
}
I also tried a version where I used the prometheus.Labels
map and would add the extra labels and then delete the key as after use. It's not very elegant but was slightly better memory-wise than the slices.
pkg: [github.com/fastly/fastly-exporter/pkg/realtime](http://github.com/fastly/fastly-exporter/pkg/realtime)
BenchmarkProcess-12 1513 787019 ns/op 31202 B/op 241 allocs/op
It looked like this:
labels["status_group"] = "1xx"
m.ComputeRespStatusTotal.With(labels).Add(float64(stats.ComputeRespStatus1xx))
... etc.
delete(labels, "status_group")
While exploring ways to optimize this I had a realization. This code is still iterating over the per-datacenter metrics and incrementing counters. Instead I think it could just use the Aggregated
metrics that are returned by the real-time API and only increment once. The datacenter label could still exist just with the value of "aggregated". This means not adding complexity to the metric initialization and we can retain the same Process() func. In fact, the simplest approach would be to just change the datacenter name to "aggregated" and keep the label.
const aggdc = "aggregated"
func Process(response *Response, serviceID, serviceName, serviceVersion string, m *Metrics, aggOnly bool) {
for _, d := range response.Data {
for datacenter, stats := range d.Datacenter {
if aggOnly {
datacenter = aggdc
}
// increment all the metrics
}
}
or using Aggregate
: (this would be my suggestion as we skip all the iteration)
const aggdc = "aggregated"
func Process(response *Response, serviceID, serviceName, serviceVersion string, m *Metrics, aggOnly bool) {
for _, d := range response.Data {
if aggOnly {
process(serviceID, serviceName, aggdc, stats, m)
} else {
for datacenter, stats := range d.Datacenter {
process(serviceID, serviceName, datacenter, stats, m)
}
}
}
}
func process(serviceID, serviceName, datacenter string, stats Datacenter, m *Metrics) {
// increment all the things
m.ComputeRespStatusTotal.WithLabelValues(serviceID, serviceName, datacenter, "1xx").Add(float64(stats.ComputeRespStatus1xx))
// etc
}
Since my last comment was very long and rambling I thought it might help to put my idea into code:
https://github.com/fastly/fastly-exporter/compare/agg-only?expand=1
On first pass, I totally agree with you on the new approach. My initial implementation was about just setting the datacenter to aggregated
but I was expecting to be rejected because it seemed lazy and instead decided to do the proper implementation of removing the label.
How much refinement do we need to your branch to turn it into something functional? I assume we need to make it work for domain and origin metrics?
I really am hoping to convince you based on the metrics that defaulting to per datacenter is a bad idea. The cardinality explosion is so major, that if you have hundreds of active services you are going to swap yourself in metrics. Add this in to the cost of using domain and origin inspector and you are multiplying out yet again.
This is the type of breaking change that I feel would be beneficial to all users. Make them opt-in to datacenter metrics if they need that detail. I doubt most realise how costly it is and likely doesn't provide tangible benefits.
On first pass, I totally agree with you on the new approach. My initial implementation was about just setting the datacenter to
aggregated
but I was expecting to be rejected because it seemed lazy and instead decided to do the proper implementation of removing the label.
While it does seem a little off to use "aggregated" as the datacenter I do like the simplicity of that implementation. I also like that it doesn't break existing dashboards that might rely on the datacenter label. I agree that the ideal case would probably be to remove the label completely but I try to be pragmatic about these things.
How much refinement do we need to your branch to turn it into something functional? I assume we need to make it work for domain and origin metrics?
We need to update the Readme/documenation, add support for domains/origins, and add any test cases I missed.
I really am hoping to convince you based on the metrics that defaulting to per datacenter is a bad idea. The cardinality explosion is so major, that if you have hundreds of active services you are going to swap yourself in metrics. Add this in to the cost of using domain and origin inspector and you are multiplying out yet again.
This is the type of breaking change that I feel would be beneficial to all users. Make them opt-in to datacenter metrics if they need that detail. I doubt most realise how costly it is and likely doesn't provide tangible benefits.
It's definitely a high-cardinality set of metrics and I'm sure there are other users like yourself that would benefit from the smaller set of aggregate-only metrics. But I also know users that rely on the per datacenter information so I am reluctant to make this breaking change.
Even if this is not the default I think we can add a prominent note in the README on using the new flag to reduce cardinality.
We need to update the Readme/documenation, add support for domains/origins, and add any test cases I missed.
Sounds like this is the path forward then.
Happy to help with copying the implementation and adding the same code to both origin and domain.
Happy to help with copying the implementation and adding the same code to both origin and domain.
Thank you, that would be great! Feel free to copy/modify what I quickly wrote and add it origins/domains.
Have created the initial pull requests to prepare for the feature:
I was then able to take your branch and extend it to add support for the domain and origin systems. I did some minor refactoring to reduce indentation and remove the global.
You can see the results on my feature/aggregate-only branch.
Be aware, this branch by itself does not pass unit tests until we merge #168. My plan is that once both #168 and #169 are merged, is to update this existing pull request with the code from that branch and NOT open a new pull request.
This is now the new implementation as we discussed.
Note that there is no documentation yet, if you are happy with this approach, I will add them.
Outstanding tasks:
Preliminary documentation (modified version of what we previously had) was added. I think it would be wise to educate the users of the benefit of enabling this feature and how much of an impact this can have on reducing the number of time series emitted.
Wording has been changed to your recommendations.
Been a long road to get this pull request to where we want it, but it seems we finally have something worthy of merging!
@mikelorant Thanks for your patience and all the work you did in getting this merged! I'll cut a new release asap.
Thanks so much for your efforts in getting this implemented. Been a great learning process and the solution implemented will really benefit users will large number of services.
Add flag to use aggregated datacenter metrics instead of per-datacenter.
Closes: #152