Closed angusscott closed 11 months ago
Just to add the range of issues in these metrics
example:
# TYPE cache_total counter
cache_total{target="acc_read"} 0
# TYPE cache_total counter
cache_total{target="st_read"} 0
# TYPE cache_total counter
cache_total{target="write"} 0
should be
# TYPE cache_total counter
cache_total{target="acc_read"} 0
cache_total{target="st_read"} 0
cache_total{target="write"} 0
# TYPE rpc_duration_seconds{method="eth_call",success="success"} summary
should be
# TYPE rpc_duration_seconds summary
# TYPE rpc_duration_seconds{method="eth_call",success="success"} summary
rpc_duration_seconds{method="eth_call",success="success"} {quantile="0.5"} 0
rpc_duration_seconds{method="eth_call",success="success"} {quantile="0.9"} 0
rpc_duration_seconds{method="eth_call",success="success"} {quantile="0.97"} 0
rpc_duration_seconds{method="eth_call",success="success"} {quantile="0.99"} 0
rpc_duration_seconds{method="eth_call",success="success"} {quantile="1"} 0
rpc_duration_seconds{method="eth_call",success="success"}_time 300.000000
rpc_duration_seconds{method="eth_call",success="success"}_sum 0.000000
rpc_duration_seconds{method="eth_call",success="success"}_count 5
should be
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{method="eth_call",success="success", quantile="0.5"} 0
rpc_duration_seconds{method="eth_call",success="success", quantile="0.9"} 0
rpc_duration_seconds{method="eth_call",success="success", quantile="0.97"} 0
rpc_duration_seconds{method="eth_call",success="success", quantile="0.99"} 0
rpc_duration_seconds{method="eth_call",success="success", quantile="1"} 0
rpc_duration_seconds_time{method="eth_call",success="success"} 300.000000
rpc_duration_seconds_sum{method="eth_call",success="success"} 0.000000
rpc_duration_seconds_count{method="eth_call",success="success"} 5
@mh0lt tagging you since you worked on this the other day
Also looks like some of these are captured by https://github.com/ledgerwatch/erigon/issues/8155
I'm currently looking at pushing a fix into devel tomorrow am.
Hi everyone, I've discovered another bug.
Use https://github.com/ledgerwatch/erigon/pull/8186
...
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{method="engine_exchangeTransitionConfigurationV1",success="success"} {quantile="0.5"} 0
rpc_duration_seconds{method="engine_exchangeTransitionConfigurationV1",success="success"} {quantile="0.9"} 0
rpc_duration_seconds{method="engine_exchangeTransitionConfigurationV1",success="success"} {quantile="0.97"} 0
rpc_duration_seconds{method="engine_exchangeTransitionConfigurationV1",success="success"} {quantile="0.99"} 0
rpc_duration_seconds{method="engine_exchangeTransitionConfigurationV1",success="success"} {quantile="1"} 0
rpc_duration_seconds{method="engine_exchangeTransitionConfigurationV1",success="success"}_time 300.000000
rpc_duration_seconds{method="engine_exchangeTransitionConfigurationV1",success="success"}_sum 0.000000
rpc_duration_seconds{method="engine_exchangeTransitionConfigurationV1",success="success"}_count 5
...
Should
...
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{method="engine_exchangeTransitionConfigurationV1",success="success", quantile="0.5"} 0
rpc_duration_seconds{method="engine_exchangeTransitionConfigurationV1",success="success", quantile="0.9"} 0
rpc_duration_seconds{method="engine_exchangeTransitionConfigurationV1",success="success", quantile="0.97"} 0
rpc_duration_seconds{method="engine_exchangeTransitionConfigurationV1",success="success", quantile="0.99"} 0
rpc_duration_seconds{method="engine_exchangeTransitionConfigurationV1",success="success", quantile="1"} 0
# TYPE rpc_duration_seconds gauge
rpc_duration_seconds_time{method="engine_exchangeTransitionConfigurationV1",success="success"}300.000000
# TYPE rpc_duration_seconds_sum counter
rpc_duration_seconds_sum{method="engine_exchangeTransitionConfigurationV1",success="success"}0.000000
# TYPE rpc_duration_seconds_count counter
rpc_duration_seconds_count{method="engine_exchangeTransitionConfigurationV1",success="success"}5
...
I have some confusion here. Why not use github.com/prometheus/client_golang directly instead of manually concatenating strings? Unless there's a specific reason, I'm planning to rewrite this part of the code using github.com/prometheus/client_golang.
@f0rmatting no reason, just unfinished migration. see https://github.com/ledgerwatch/erigon/issues/8206
This issue is stale because it has been open for 40 days with no activity. Remove stale label or comment, or this will be closed in 7 days.
This issue was closed because it has been stalled for 7 days with no activity.
System information
Erigon version: erigon version erigon version 2.49.1-stable-205eeda0
OS & Version: Linux
Erigon Command (with flags/config):
Chain/Network: Ethereum
Expected behaviour
Expect metrics to not be defined more than once
Actual behaviour
Metric TYPES are still being duplicated, and are breaking against this linter https://o11y.tools/metricslint/
Steps to reproduce the behaviour
Start an instance of Erigon, with --metrics enabled, and make a request.
Following from this issue https://github.com/ledgerwatch/erigon/issues/8053
the metrics endpoint (/debug/metrics/prometheus) still produces invalid metrics. Metrics were tested against this linter https://o11y.tools/metricslint/ and comes back with the following issue.
"text format parsing error in line 150: second TYPE line for metric name "cache_total", or TYPE reported after samples"
I'm verifying the same with Telegraf currently, and will include the output of that once done.
Result
(Metrics can be passed to linter to produce same result)