Improve error handling when translating classic histograms to distribution (e.g. parsing bucket "le" label edge cases)

bwplotka commented 6 months ago

We got one Cx getting error from the CMP: One or more TimeSeries could not be written: Field timeSeries[0].points[0].distributionValue had an invalid value: Distribution |explicit_buckets.bounds| entry 1 has a value of 0.005 which is less than the value of entry 0 which is 0.005. This indicates that either:

A. Client exposes classic histogram series with two series for the same le value. This is impossible as duplicated series will be handled by scraping logic. B. Our conversion from string le labels to float buckets ended up as the same 0.005 buckets. This can totally happen on extreme bucket le values like "0.00500000000000000006" and "0.005" in subsequent classic series. See repro. Furthermore check does not validate for "0" diffs, only negative diffs on it. C. Rounding error no CMP side. Unlikely as CMP accepts double value as well.

Action Items

https://github.com/GoogleCloudPlatform/prometheus-engine/blob/5f75ba3d5ae6477169b54543b3266e4d3fe22d7b/pkg/export/transform.go#L317 diff betwen buckets can't be zero too, not only negative.
We catch that problem when we parse to float to see if it's scraping/client issue or parsing to float precision issue.

TheSpiritXIII commented 5 months ago

I tested the equivalent in the standard math/big library and still can't get it to round up:

big.ParseFloat("0.00500000000000000006", 10, 53, big.ToNearestAway) // 0.005

That value cannot be expressed as a float64 (as most values really can't).

What's fascinating is that the Golang client library does not validate the buckets, and other clients probably don't either. The restrictions are by contract in the options field comments. I wonder if these buckets are being created programmatically because I would be shocked if someone created buckets with that type of precision manually!

Further, how does Prometheus itself deal with these? If Prometheus simply stores these as string-typed labels, you would still be able to query from PromQL.

The Golang client uses [sort.SearchFloat64s](https://pkg.go.dev/sort#SearchFloat64s) to pick the bucket, which is just binary search. This means that whether it goes into "0.00500000000000000006" or "0.005" will be random for different bucket sets but always consistent, meaning that one of these will be used and one will not! We may be able to take advantage of the fact that one is never used when sending data to CMP and just ignore the unused one.

But, easier said than done.

Or maybe a simpler solution since the user can't send this metric anyway is to just ignore it as invalid and log a warning.

csapuntz commented 4 months ago

A big usability improvement would be know which metric is causing the problem.

csapuntz commented 3 months ago

FWIW, my org's software had a bug where it was emitting the same metrics twice and it triggered these errors. Our collectors in GKE are using prometheus:v2.45.3-gmp.5-gke.0 (sha256:aa631b2f63f54f350887000fb9f1fc2cf1a4820652140e458751f27858bdf081)

GoogleCloudPlatform / prometheus-engine

Improve error handling when translating classic histograms to distribution (e.g. parsing bucket "le" label edge cases) #982

Action Items