fastly / fastly-exporter

A Prometheus exporter for the Fastly Real-time Analytics API
Apache License 2.0
98 stars 36 forks source link

Better logging for rt.fastly.com (Client.Timeout exceeded while awaiting headers) #114

Open mrnetops opened 2 years ago

mrnetops commented 2 years ago

Because of how fastly-exporter will wait for new stats to be published for services, we tend to get a ton of logging like this for services that are simply not handling requests, and so not generating stats.

level=error component=rt.fastly.com service_id=xxx during="execute request" err="Get \"https://rt.fastly.com/v1/channel/xxx/ts/1666656765\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

This can make it hard to suss out of there are in fact errors with or connecting to rt.fastly.com, vs simply having a number of idle services. This is a problem that is going to scale with the number of services in play the the account in question. (assuming more services overall is going to increase the incident and volume of idle services)

Possibly these errors should be reclassed as info as they are byproducts of the intended use case of connecting and listening for stat updates. and/or we should have better logging for when there are issues (connection refused, non-2xx responses, etc)

Short term, I have attempted to minimize the spurious errors with -rt-timeout 120s to increase the likelyhood of a service request -> stat response.

Interestingly, that seems to have tentatively addressed all of the errors, which makes me wonder if there is an interaction with a maximum time to stat response from rt.fastly.com, even if stats are zero. So possibly, raise that default to > the maximum stat response time from rt.fastly.com (if that is in fact what is happening)?

leklund commented 1 year ago

@mrnetops I was trying to reproduce this issue and I'm unable to get request timeouts for new services or services without any data. Real time stats should be returning immediately if it doesn't have any data for a given service ID. It can wait up to 30 seconds for new data for a service that had some data previously but that should still return well under the default 45 second timeout. Are you still able to reproduce this issue?

mrnetops commented 1 year ago

I don't think I have seen it come up recently, but I'll keep an eye out.