fastly / fastly-exporter

A Prometheus exporter for the Fastly Real-time Analytics API
Apache License 2.0
99 stars 36 forks source link

Metric Timestamps #77

Open SuperQ opened 3 years ago

SuperQ commented 3 years ago

While it's normal not best practice to expose timestamps. I think this exporter may need them.

I have been seeing issues with update vs scrape time alignment.

This may also be an artifact of the latency for ingesting updates from the real-time API. If the update comes in just slightly after the scrape, the update will not include some data. But the next scrape will catch up with the current value.

This causes artifacts in the graphs.

Screen Shot 2021-10-04 at 18 04 12
peterbourgon commented 3 years ago

~You can change the poll intervals.~

edit: Not the actual real-time stats interval, though — see below.

peterbourgon commented 3 years ago

Actually, just for my understanding...

The real-time API takes a Unix timestamp as input, and returns after an arbitrary period of time with, essentially, deltas of all the stats. So you could give ts=100 and get back ts=105 with 5s worth of data, or get back ts=113 with 13s of data, etc. What timestamp would you apply? The time when the response was received?

SuperQ commented 3 years ago

I'm thinking it should get the end timestamp of the delta accumulation.

SuperQ commented 3 years ago

Yes, my thinking is that because there is some jitter in the timing of each real-time poll, and when Prometheus performs the scrape.

peterbourgon commented 3 years ago

I dug around in the docs for about 15m and couldn't figure out how to do this. Given this is how metrics are updated how should I apply a specific timestamp?

SuperQ commented 3 years ago

I'm going to continue to do some debugging on my end. I'll get back to you on what I find out.

bjk-reddit commented 3 years ago

Yup, exactly as I thought. The time to pull from the API is highly variable. This means that sometimes data arrives slightly on one side or the other from a scrape.

Test patch:

diff --git a/pkg/rt/subscriber.go b/pkg/rt/subscriber.go
index a0ec01d..8551018 100644
--- a/pkg/rt/subscriber.go
+++ b/pkg/rt/subscriber.go
@@ -131,6 +131,7 @@ func (s *Subscriber) Run(ctx context.Context) error {
 // Recoverable errors are logged internally and not returned, so any non-nil
 // error returned by this method should be considered fatal to the subscriber.
 func (s *Subscriber) query(ctx context.Context, ts uint64) (currentName string, result apiResult, delay time.Duration, newts uint64, fatal error) {
+       start := time.Now()
        name, ver, found := s.provider.Metadata(s.serviceID)
        version := strconv.Itoa(ver)
        if !found {
@@ -169,7 +170,7 @@ func (s *Subscriber) query(ctx context.Context, ts uint64) (currentName string,

        switch resp.StatusCode {
        case http.StatusOK:
-               level.Debug(s.logger).Log("status_code", resp.StatusCode, "response_ts", response.Timestamp, "err", apiErr)
+               level.Debug(s.logger).Log("status_code", resp.StatusCode, "response_ts", response.Timestamp, "err", apiErr, "took", time.Since(start).Seconds())
                if strings.Contains(apiErr, "No data available") {
                        result = apiResultNoData
                } else {

Results:

level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434059 err=<none> took=2.018836437
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434060 err=<none> took=0.971126704
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434061 err=<none> took=0.958338924
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434062 err=<none> took=1.061254601
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434063 err=<none> took=0.969232367
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434064 err=<none> took=0.979430087
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434065 err=<none> took=1.134385254
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434066 err=<none> took=0.829105108
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434067 err=<none> took=0.986448288
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434068 err=<none> took=0.988539304
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434069 err=<none> took=0.970611605
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434070 err=<none> took=1.12616397
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434071 err=<none> took=0.915297038
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434072 err=<none> took=1.005067603
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434073 err=<none> took=0.929945924
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434074 err=<none> took=1.074395758
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434075 err=<none> took=0.959273333
level=debug component=rt.fastly.com service_id=SECRET status_code=200 response_ts=1633434076 err=<none> took=0.995491316
peterbourgon commented 3 years ago

I guess I don't really understand how timestamps will help here. Are you saying to use a timestamp of the response_ts? Are you seeing that as substantially different than the local time?

edit: I ran it for a bit while logging the diff between the response_ts and the local timestamp when the response was received. It varied but only up to about 1100ms, which is exactly what you'd expect given the response_ts only has 1s granularity. It also seems to line up with your data. Given the API "refresh interval" is about 1s and the Prometheus scrape interval is 15s, I can't see how the dips and peaks in the graph are a function of any kind of data skew. Can you explain like I'm 5?

bjk-reddit commented 3 years ago

The issue is much more visible when you have a large amount of traffic and you're looking at the extrapolated rate between two scrapes. Any small amount of missing accumulation are greatly exaggerated by the extrapolation.

Think about it this way. If the update interval is 1s, and the scrape interval is 15s. An individual scrape could be missing 1/15th of the data. This will lead to the rate being off by -6.7%. The next scrape includes the missing data so it's now +6.7%

By including the timestamp, we align the data with the actual time window of what it represents. Currently the timestamps are aligned with the end timestamp of the data in addition to the random lag in the API.

peterbourgon commented 3 years ago

By including the timestamp, we align the data with the actual time window of what it represents.

Ah, but the returned data always lags the truth to at least some degree. At the moment, by several seconds! See the AggregateDelay parameter described here, which we don't consider at all right now. That could be a feature request, but it's of arguable benefit, because there's always going to be the possibility of "data drift" where some increments arrive late to the system serving rt.fastly.com itself. So I don't think computing "the extrapolated rate between two scrapes" will ever be precise in the way you're assuming/hoping it is, just by the nature of the system.

Manipulating data timestamps also introduces risks related to those timestamps being incorrect, for whatever reason. That would require safeguards, and the semantics of those safeguards in the presence of unsound input data are non-obvious. Can of worms....

bjk-reddit commented 3 years ago

Lagging behind the truth is fine, what's important is that the timestamp of the truth is reasonable. We do this for importing data from other monitoring systems, for example AWS Cloudwatch.

peterbourgon commented 3 years ago

I'm sure I'm being dense here but I still don't understand how a difference of less than a second — which is the granularity of the data itself — is meaningful enough to warrant correction. The delta never exceeds 1s, right? Would you not see exactly the same squiggles in the graph if we did the correction you're suggesting, but Prometheus happens to scrape on something other than precisely the .000 of the scrape interval? I'm honestly not trying to be difficult, I honestly don't understand how this would help. Or, I think maybe equivalently: I honestly don't understand why you think the extrapolated rate between two scrapes should be as precise as you're expecting.

bjk-reddit commented 3 years ago

The problem is that the API is crazy jittery, and aligns perfectly with second boundaries. So the delay wobbles the results before and after scrape alignment. It's fine if there is enough offset between the Prometheus scrape and the width of the API wobble. But when they're closely aligned, the data becomes excessively noisy.

I will work on a patch to see if I can demonstrate this.

peterbourgon commented 3 years ago

The problem is that the API is crazy jittery

In my tests it has a delta of like +/- 1s. Is that a lot? And any exporter which is a proxy to an API is surely going to have this issue, right? My intuition is that it would typically be much more "jitter" than 1s -- am I off base? Do all exporters like this need to provide timestamps?

SuperQ commented 3 years ago

Since it's basically polling another monitoring system, yes

Another way we could handle this, is to convert the exporter to become a remote write streaming sever. Rather than scrape -> scrape, scrape -> stream.