fastly / fastly-exporter

A Prometheus exporter for the Fastly Real-time Analytics API
Apache License 2.0
99 stars 36 forks source link

Question: Strategy for large number of services #31

Closed neufeldtech closed 5 years ago

neufeldtech commented 5 years ago

Hi There, Thanks again for your time developing this software, it's helped us out immensely thus far. We've been running an old version (version 0.x) for quite a while and it's been good to us. Right now we're running two fastly_exporter instances with approximately 150 services each on two VMs to share the load. I'm interested in the auto-discovery feature that you've implemented in the new versions of this exporter but I have concerns about how I can manage a large number of Fastly services with it.

In total, we have approximately 900 Fastly services deployed to one Fastly account. As one can imagine, if I were to even attempt to boot the fastly_exporter with autodiscovery enabled, it would only be a bad time. Up until this point, I'd been manually curating our list of 'important services' to monitor with the exporter by manually filtering out our Staging environments, etc.

I was wondering if there were any existing strategies out there for dealing with an excessive number of Fastly properties with the exporter, and how one might go about architecting the exporter and the prometheus ingestion to deal with these volumes.

A couple of key things come to mind:

peterbourgon commented 5 years ago

Interesting. At a high level, I don't see an inherent reason that one fastly-exporter shouldn't be able to handle 1000 services, though I can imagine the current architecture may not be ideal. Have you tried? If so, how does it explode?

Even if it's possible to do everything in one process, I can certainly understand why you'd want to "shard" the services out over multiple processes. My intuition would be to do something really simplistic. As a strawman, maybe we could have -shard-total and -shard-identity integer flags. If they're set, and if no explicit -service is provided, then each instance will "own" the discovered service IDs whose hash, modulo total, is equal to identity. So, if you want to split across 3 fastly-exporter instances, you'd start them as

fastly-exporter ... -shard-total 3 -shard-identity 0
fastly-exporter ... -shard-total 3 -shard-identity 1
fastly-exporter ... -shard-total 3 -shard-identity 2

I would want to have less stupid names for those flags, or otherwise make it more intuitive to set up—but would something like that work?

peterbourgon commented 5 years ago

Thinking further, -service-name-regexp ... would also be a nice way to do things, would that be equally effective for you?

neufeldtech commented 5 years ago

Did a few experiments with the docker-compose file included in the repo.

When running v2.2.0, I let the exporter attempt to discover all services dynamically.

When it boots, it discovers all 923 services successfully 👍

However, after letting it run for several minutes, it's only able to scrape 273 service IDs.

The exporter reports a constant stream of timeouts for the other services that are not working, similar to this:

fastly-exporter_1  | level=error component=monitors service_id=REDACTED service_name=redacted-service-name.example.com err="Get https://rt.fastly.com/v1/channel/REDACTED/ts/1563245983: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

I like the idea of using the hashmod approach that you've described for predictable 'sharding'.

I'm also a big fan of the regex approach, which I could see myself using for excluding large swaths of services that are generated for CI builds (that have convention-based names), but should not be included in monitoring.

I think these two features together would open up a broad range of flexibility to be able to easily support large deployments.

Let me know if you'd like me to run any further tests with different timeout settings, or if other logs would be helpful for looking into possible constraints of a single exporter.

peterbourgon commented 5 years ago

Great, this is good food for thought. I'm pretty sure we can come up with something that will work, let me roll it around in my head for a little while. In the meantime, is it possible that you can build and test out the version in #32, under the maxconns branch? I have a hunch that it might eliminate the timeouts.

neufeldtech commented 5 years ago

I built the maxconns branch and tried it with docker-compose pointed to my local image, with the same results as before. As you can see, it gradually is able to scrape more services, but levels off at 275 this time (much the same as before).

image

I built and ran the both the master branch, and the maxconns branch locally on my OSX machine, and I only ever observed 2 connections to rt.fastly.com while it was running (even with all the services). Is this what you'd expect to see? 🤔

my-mac$ netstat -anv | grep `pidof fastly-exporter`
tcp4       0    113  10.0.0.145.54733       151.101.126.34.443     ESTABLISHED 566694 131072  24260      0 0x0102 0x00000020
tcp4       0      0  127.0.0.1.8080         *.*                    LISTEN      131072 131072  24260      0 0x0100 0x00000026
tcp4       0      0  10.0.0.145.54595       151.101.126.35.443     ESTABLISHED 243264 131768  24260      0 0x0102 0x00000028
peterbourgon commented 5 years ago

Should be good now 🚀