mikelorant commented 1 year ago

Problem

Currently, when collecting stats for 201 services after running the exporter for 13 days with 12 shards the metrics endpoint output size is as follows:

Shard	Services	Payload (KB)
1	18	30,792
2	25	55,378
3	16	40,243
4	15	29,123
5	21	34,345
6	22	40,100
7	10	19,948
8	19	47,790
9	11	20,234
10	15	40,499
11	19	37,366
12	19	29,092
Total	210	424,910

With a scrape interval of 60 seconds the bandwidth requirement becomes 7,082 KB/s. In terms of storage requirements, this is 424,910 KB 60 mins 24 hours = 584 GB of raw data per day.

This can cause considerable impact on Prometheus scraping performance as this is a very large payload.

Proposal

Currently, each datacenter is a label which multiplies the number of each metric. When combined with a metric that has a status_code label this can explode the number of metrics returned.

A possible solution to reduce the output size of the metrics endpoint would be to aggregate the datacenter.

Analysis of how this might impact the output size for the earlier example is as follows:

Shard	Services	Payload (KB)
1	17	645
2	25	934
3	16	607
4	14	531
5	21	796
6	21	786
7	10	394
8	18	686
9	10	398
10	15	582
11	19	718
12	15	569
Total	201	7,646

With a scrape interval of 60 seconds the bandwidth requirement becomes 127 KB/s. In terms of storage requirements, this is 7,646 KB 60 mins 24 hours = 11 GB of raw data per day.

A comparison to the results with having individual datacenter metrics shows the following improvements:

Datacenter	Payload (KB)	Rate (KB/s)	Storage (Daily in GB)	Reduction
Individual	424,910	7082	584
Aggregated	7,646	127	11	98%

A side effect of having aggregated datacenter metrics would be the memory consumption should be reduced. It it hard to determine the exact impact but there should certainly be some improvements.

Conclusion

Aggregated data center metrics would provide an option for users that wish to reduce the metrics endpoint output size. By providing this as an option (not the default) this would allow users to decide if the benefits of reducing the output size outweigh the loss of inidivual datacenter metrics.

mikelorant commented 1 year ago

I have worked with @matthope to create a preliminary implementation of this feature. Before I take the final steps to turn this into an open pull request I wanted to have a discussion if this is a feature that would be beneficial to add the Fastly exporter.

The initial implementation is based on the work done by @matthope in 2020 and required some effort to bring forward to the head of the master branch.

I then took the opportunity to refine the implementation based on his feedback and our discussions.

This means this work will need to be combined into 2 commits each attributed to the developer who added the code. As there are no contributing guidelines there is clarity required about how this work should be submitted.

The current state of the implementation is:

1 commit by @matthope
4 commits by @mikelorant

These are a combination of 2 stacked branches layered upon master.

The final diff report can be viewed here: https://github.com/fastly/fastly-exporter/compare/main...fairfaxmedia:fastly-exporter:feature/aggregate-datacenter-improve

Any feedback would be greatly appreciated.

mikelorant commented 1 year ago

Preliminary pull request #153 created.

mikelorant commented 6 months ago

This pull request is being split into multi pull requests.

The first change is to refactor the way labels are implemented allowing the default labels to be changed easily. See pull request #167.

fastly / fastly-exporter

Reduce output size of metrics endpoint #152

Problem

Proposal

Conclusion