[prometheus][remote_write] Failing to parse some histogram fields

tetianakravchenko commented 1 year ago

Some documents are dropped due to:

"prometheus\":{\"apiserver_flowcontrol_priority_level_request_utilization\":{\"histogram\":{\"counts\":[5000945144],\"values\":[0.25]}},\"labels\":{\"instance\":\"10.128.0.10:443\",\"job\":\"kubernetes-apiservers\",\"phase\":\"waiting\",\"priority_level\":\"node-high\"}},\"service\":{\"type\":\"prometheus\"}}, Private:interface {}(nil), TimeSeries:true}, Flags:0x0, Cache:publisher.EventCache{m:mapstr.M(nil)}} (status=400): {\"type\":\"document_parsing_exception\",

    \"reason\":\"[1:2472] failed to parse field [prometheus.apiserver_flowcontrol_priority_level_request_utilization.histogram] of type [histogram]\",\"caused_by\":{\"type\":\"illegal_argument_exception\",

    "reason\":\"[1:2482] Numeric value (5000945144) out of range of int (-2147483648 - 2147483647)\\n at

"reason":"[1:3039] failed to parse field [prometheus.go_gc_pauses_seconds_total.histogram] of type [histogram]","caused_by":{"type":"document_parsing_exception","reason":"[1:3039] error parsing field [prometheus.go_gc_pauses_seconds_total.histogram], [values] values must be in increasing order, got [-4.9E-324] but previous value was [0.0]"}}, dropping event!

This could be related to the fact that the datastream was actually dropped first to empty the index

tetianakravchenko commented 1 year ago

The second error ([values] values must be in increasing order, got [-4.9E-324] but previous value was [0.0]") is related to this issue - https://github.com/elastic/beats/issues/36317, and is going to be fixed soon.

pjbertels commented 1 year ago

'kubernetes-apiservers' and job_name: 'kubernetes-cadvisor' are the two scraping targets that generate the histograms in my setup.

tetianakravchenko commented 1 year ago

I was able to reproduce the issue on my setup as well for multiple apiserver_flowcontrol_* histograms, it is actually just 3 metrics: apiserver_flowcontrol_priority_level_request_utilization, apiserver_flowcontrol_demand_seats, apiserver_flowcontrol_read_vs_write_current_requests

After some time, I see the histogram metric - prometheus.apiserver_flowcontrol_priority_level_request_utilization.histogram, but it is empty - {"values":[],"counts":[]}, not sure if it is a correct value:

tetianakravchenko commented 1 year ago

opened elasticsearch issue - https://github.com/elastic/elasticsearch/issues/99820 one thing I can think of for now - add check on the beats side, so not whole document with all other metrics will be dropped

tetianakravchenko commented 1 year ago

regarding the error: reason":"[1:2805] failed to parse field [prometheus.go_gc_pauses_seconds_total.histogram] of type [histogram]","caused_by":{"type":"document_parsing_exception","reason":"[1:2805] error parsing field [prometheus.go_gc_pauses_seconds_total.histogram], [values] values must be in increasing order, got [-4.9E-324] but previous value was [0.0]"}}, dropping event!

all similar error seems to be coming from the kubernetes-nodes job.

The actual metric looks like:

curl -s localhost:10249/metrics | grep go_gc_pauses_seconds_total
# HELP go_gc_pauses_seconds_total Distribution individual GC-related stop-the-world pause latencies.
# TYPE go_gc_pauses_seconds_total histogram
go_gc_pauses_seconds_total_bucket{le="-5e-324"} 0
go_gc_pauses_seconds_total_bucket{le="9.999999999999999e-10"} 0
go_gc_pauses_seconds_total_bucket{le="9.999999999999999e-09"} 0
go_gc_pauses_seconds_total_bucket{le="9.999999999999998e-08"} 0
go_gc_pauses_seconds_total_bucket{le="1.0239999999999999e-06"} 0
go_gc_pauses_seconds_total_bucket{le="1.0239999999999999e-05"} 24575
go_gc_pauses_seconds_total_bucket{le="0.00010239999999999998"} 25754
go_gc_pauses_seconds_total_bucket{le="0.0010485759999999998"} 51322
go_gc_pauses_seconds_total_bucket{le="0.010485759999999998"} 51579
go_gc_pauses_seconds_total_bucket{le="0.10485759999999998"} 51628
go_gc_pauses_seconds_total_bucket{le="+Inf"} 51628
go_gc_pauses_seconds_total_sum NaN
go_gc_pauses_seconds_total_count 51628

the first bucket actually is a negative value - le="-5e-324"

the same behavior for some other metrics - go_sched_latencies_seconds

this will be fixed by https://github.com/elastic/beats/pull/36647

tetianakravchenko commented 1 year ago

first error - Numeric value (5000945144) out of range of int (-2147483648 - 2147483647) should be fixed in https://github.com/elastic/elasticsearch/issues/99820

second error - [values] values must be in increasing order, got [-4.9E-324] but previous value was [0.0]" should be closed in https://github.com/elastic/beats/pull/36647

both PRs were merged and will be available in 8.11.0

elastic / integrations

[prometheus][remote_write] Failing to parse some histogram fields #7893