Implement aggregated datacenter metrics

mikelorant commented 1 year ago

Add flag to use aggregated datacenter metrics instead of per-datacenter.

Closes: #152

mikelorant commented 12 months ago

@leklund Would you be able to do a preliminary review, please?

leklund commented 10 months ago

@mikelorant Sorry this has been sitting for so long. I'll take a look at this early next week.

mikelorant commented 8 months ago

@leklund Thanks for the all the feedback!

All your points make sense and I am happy to make all the changes you requested. I'll expect to have your requested changes implemented by the time you get back.

mikelorant commented 6 months ago

Breaking this pull request up into smaller pieces.

Refactor labels used in realtime package

leklund commented 6 months ago

@mikelorant Thanks for the new PR! I spent some time looking at this today and had some thought/ideas. Left a summary at the top because it got long.

Summary

I had a realization that instead of adding complexity to remove the datacenter label you could instead just use a constant for the datacenter and it will have the same desired effect of reducing the /metrics response size. It could also skip the datacenter iteration completely and just use the Aggreagate field from the Response struct (that's the last sample code at the very bottom).

Details

I was curious about performance with all the extra map allocations. I wrote a simple benchmark for Process():

main branch:

pkg: [github.com/fastly/fastly-exporter/pkg/realtime](http://github.com/fastly/fastly-exporter/pkg/realtime)
BenchmarkProcess-12         2840        419057 ns/op        1883 B/op         57 allocs/op

mikelorant:refactor/labels using mergeLabels:

pkg: [github.com/fastly/fastly-exporter/pkg/realtime](http://github.com/fastly/fastly-exporter/pkg/realtime)
BenchmarkProcess-12         1012        999219 ns/op      264528 B/op       1640 allocs/op

I haven't tested out the real-world implications but it could be noticeable if you happen to use the exporter with a large number of service.

I did a quick pass to use a slices instead of the map and it's better:

pkg: [github.com/fastly/fastly-exporter/pkg/realtime](http://github.com/fastly/fastly-exporter/pkg/realtime)
BenchmarkProcess-12         2558        465463 ns/op       74551 B/op        815 allocs/op

That last benchmark used this code:

func Process(response *Response, serviceID, serviceName, serviceVersion string, m *Metrics) {
    labels := make([]string, 3, 3)

    labels[0] = serviceID
    labels[1] = serviceName

    for _, d := range response.Data {
        for datacenter, stats := range d.Datacenter {
            labels[2] = datacenter
            process(labels, stats, m)
        }
    }
}

func process(labels []string, stats Datacenter, m *Metrics) {
    ...
    m.StatusCodeTotal.WithLabelValues(append(labels, "200")...).Add(float64(stats.Status200))
    ...
}

I also tried a version where I used the prometheus.Labels map and would add the extra labels and then delete the key as after use. It's not very elegant but was slightly better memory-wise than the slices.

pkg: [github.com/fastly/fastly-exporter/pkg/realtime](http://github.com/fastly/fastly-exporter/pkg/realtime)
BenchmarkProcess-12         1513        787019 ns/op       31202 B/op        241 allocs/op

It looked like this:

labels["status_group"] = "1xx"
m.ComputeRespStatusTotal.With(labels).Add(float64(stats.ComputeRespStatus1xx))
... etc.
delete(labels, "status_group")

While exploring ways to optimize this I had a realization. This code is still iterating over the per-datacenter metrics and incrementing counters. Instead I think it could just use the Aggregated metrics that are returned by the real-time API and only increment once. The datacenter label could still exist just with the value of "aggregated". This means not adding complexity to the metric initialization and we can retain the same Process() func. In fact, the simplest approach would be to just change the datacenter name to "aggregated" and keep the label.

const aggdc = "aggregated"

func Process(response *Response, serviceID, serviceName, serviceVersion string, m *Metrics, aggOnly bool) {
    for _, d := range response.Data {
        for datacenter, stats := range d.Datacenter {
            if aggOnly {
                datacenter = aggdc
            }
        // increment all the metrics
    }
}

or using Aggregate: (this would be my suggestion as we skip all the iteration)

const aggdc = "aggregated"

func Process(response *Response, serviceID, serviceName, serviceVersion string, m *Metrics, aggOnly bool) {
     for _, d := range response.Data {
        if aggOnly {
            process(serviceID, serviceName, aggdc, stats, m)
        } else {
            for datacenter, stats := range d.Datacenter {
                process(serviceID, serviceName, datacenter, stats, m)
            }
        }
    }
}

func process(serviceID, serviceName, datacenter string, stats Datacenter, m *Metrics) {
    // increment all the things 
     m.ComputeRespStatusTotal.WithLabelValues(serviceID, serviceName, datacenter, "1xx").Add(float64(stats.ComputeRespStatus1xx))
    // etc
}

leklund commented 6 months ago

Since my last comment was very long and rambling I thought it might help to put my idea into code:

https://github.com/fastly/fastly-exporter/compare/agg-only?expand=1

mikelorant commented 6 months ago

On first pass, I totally agree with you on the new approach. My initial implementation was about just setting the datacenter to aggregated but I was expecting to be rejected because it seemed lazy and instead decided to do the proper implementation of removing the label.

How much refinement do we need to your branch to turn it into something functional? I assume we need to make it work for domain and origin metrics?

I really am hoping to convince you based on the metrics that defaulting to per datacenter is a bad idea. The cardinality explosion is so major, that if you have hundreds of active services you are going to swap yourself in metrics. Add this in to the cost of using domain and origin inspector and you are multiplying out yet again.

This is the type of breaking change that I feel would be beneficial to all users. Make them opt-in to datacenter metrics if they need that detail. I doubt most realise how costly it is and likely doesn't provide tangible benefits.

leklund commented 6 months ago

On first pass, I totally agree with you on the new approach. My initial implementation was about just setting the datacenter to aggregated but I was expecting to be rejected because it seemed lazy and instead decided to do the proper implementation of removing the label.

While it does seem a little off to use "aggregated" as the datacenter I do like the simplicity of that implementation. I also like that it doesn't break existing dashboards that might rely on the datacenter label. I agree that the ideal case would probably be to remove the label completely but I try to be pragmatic about these things.

How much refinement do we need to your branch to turn it into something functional? I assume we need to make it work for domain and origin metrics?

We need to update the Readme/documenation, add support for domains/origins, and add any test cases I missed.

I really am hoping to convince you based on the metrics that defaulting to per datacenter is a bad idea. The cardinality explosion is so major, that if you have hundreds of active services you are going to swap yourself in metrics. Add this in to the cost of using domain and origin inspector and you are multiplying out yet again.

This is the type of breaking change that I feel would be beneficial to all users. Make them opt-in to datacenter metrics if they need that detail. I doubt most realise how costly it is and likely doesn't provide tangible benefits.

It's definitely a high-cardinality set of metrics and I'm sure there are other users like yourself that would benefit from the smaller set of aggregate-only metrics. But I also know users that rely on the per datacenter information so I am reluctant to make this breaking change.

Even if this is not the default I think we can add a prominent note in the README on using the new flag to reduce cardinality.

mikelorant commented 6 months ago

We need to update the Readme/documenation, add support for domains/origins, and add any test cases I missed.

Sounds like this is the path forward then.

Happy to help with copying the implementation and adding the same code to both origin and domain.

leklund commented 6 months ago

Happy to help with copying the implementation and adding the same code to both origin and domain.

Thank you, that would be great! Feel free to copy/modify what I quickly wrote and add it origins/domains.

mikelorant commented 6 months ago

Have created the initial pull requests to prepare for the feature:

168
169

I was then able to take your branch and extend it to add support for the domain and origin systems. I did some minor refactoring to reduce indentation and remove the global.

You can see the results on my feature/aggregate-only branch.

Be aware, this branch by itself does not pass unit tests until we merge #168. My plan is that once both #168 and #169 are merged, is to update this existing pull request with the code from that branch and NOT open a new pull request.