Open lcmgh opened 1 year ago
Hello,
Thanks for the detailed bug report. I don't think I can take a deeper look at it before the end of the week, but a preliminary check I can do is:
function_calls_duration
metric is a summary
instead of a histogram
(not your fault, and since we set the exporter
using buckets here it actually might be a bug/issue in the metrics-prometheus-exporter
lib we use for you here) – and to be clear, that's why the dashboard is failing; the queries are made for a histogram metric, not a summary oneIs there any specific reason you chose the metrics
feature?
Can you try to use the prometheus
feature instead and tell us if you have the same issue? If you don't have the issue anymore, I'll create an issue from here to autometrics-rs
as it's a bug with the metrics
feature. If you still have the issue, can you show the data sent by the Prometheus endpoint again?
Hi @gagbo! Thanks for checking back. I have used the metrics
feature because my app has already a Prometheus exporter endpoint.
/// use metrics_exporter_prometheus::PrometheusBuilder;
let prom = PrometheusBuilder::new();
prom
.install()
.expect("failed to install recorder/exporter");
I was not sure if that feature exposes the Prometheus endpoint at a port and wanted not to have any interference here.
So switching to prometheus
makes the autometrics metrics not appear anymore on my existing exporter endpoint.
After switching back to metrics
it is similar to before
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0.5"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0.9"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0.95"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0.99"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0.999"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="1"} 0
function_calls_duration_sum{function="batch_insert",module="myapp::mod::dao"} 0.012257653
function_calls_duration_count{function="batch_insert",module="myapp::mod::dao"} 3
That's weird too (that using the prometheus
feature only doesn't work with the existing metrics setup). Are you using the default registry of Prometheus client library, or a separate custom library?
For the record, the prometheus
feautures doesn't try to expose a scrapping endpoint, the prometheus-exporter
feature does. It's currently documented under " For projects already using custom Prometheus metrics " in the README, not sure what we can do yet to make that clearer.
To go back on topic, can you instead try to change your prometheus exporter initialization:
/// use metrics_exporter_prometheus::PrometheusBuilder;
let prom = PrometheusBuilder::new();
prom
+ .set_buckets(&[0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0])
+ .expect("Failed to set histogram buckets")
.install()
.expect("failed to install recorder/exporter");
and see if it solves at least[^compat] the "function_calls_durations as summary
" issue?
[^compat]: I think this might be incompatible with your existing metrics (because this apparently forces all histograms to be exported as histogram
instead of summary
), but if that's the case I'd rather handle that in a separate issue
I have adapted your changes and the metric endpoint now gives
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.005"} 2
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.01"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.025"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.05"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.075"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.1"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.25"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.5"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.75"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="1"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="2.5"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="5"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="7.5"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="10"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="+Inf"} 3
function_calls_duration_sum{function="batch_insert",module="myapp::mod::dao"} 0.013500508000000001
function_calls_duration_count{function="batch_insert",module="myapp::mod::dao"} 3
Yeah, that's better, at least the duration related metrics should work. I still have 2 questions:
function_calls_count
metric that is exposed?function_calls_count
metric (forgot to include it in my previous posts)
# HELP function_calls_count Autometrics counter for tracking function calls
# TYPE function_calls_count counter
function_calls_count{function="batch_insert",module="myapp::mod::dao",caller="",result="ok"} 1
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.005"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.01"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.025"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.05"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.075"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.1"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.25"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.5"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.75"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="1"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="2.5"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="5"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="7.5"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="10"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="+Inf"} 1 function_calls_duration_sum{function="batch_insert",module="myapp::mod::dao"} 0.002298264 function_calls_duration_count{function="batch_insert",module="myapp::mod::dao"} 1
* Dashboard still shows no data in all three panels
With applied code from https://github.com/autometrics-dev/autometrics-shared/issues/66#issuecomment-1611255460.
I see that the output of the scrapping endpoint looks more like what I'm expecting, so this part is solved.
I'm trying to understand what happened to the dashboard, but I think I'll have to test this and reproduce to understand the issue.
batch_insert
function name as input shows that at least the part extracting valid function labels is working fine (so the dashboard is well connected to the Prometheus data source, and can see the autometrics metrics)build_info
metric and how we use it in the query to generate the graph.Also, the sampling interval used to show the graph is dynamic, and changes with the timespan shown. Do you still have the issue if you change the "Last 6 hours" in the top-right corner to something significantly shorter like "Last 5 minutes"? I hope that it will give enough points (depending on your scrape_interval
) to show a curve
Do you still have the issue if you change the "Last 6 hours" in the top-right corner to something significantly shorter like "Last 5 minutes"?
Oh, actually I now have found some lines in the Latency (95th and 99th Percentile) panel but still no data in the other ones. The other ones even show an error but only when going back to 7 days.
execution: found duplicate series for the match group {instance="xxxxx:10250", job="kubelet"} on the right hand-side of the operation: [{__name__="up", endpoint="https-metrics", instance="xxxx:10250", job="kubelet", metrics_path="/metrics/cadvisor", namespace="kube-system", node="xxxxxxxxxxx", service="prometheus-operator-kubelet"}, {__name__="up", endpoint="https-metrics", instance="xxxx", job="kubelet", metrics_path="/metrics/probes", namespace="kube-system", node="xxxxx", service="prometheus-operator-kubelet"}];many-to-many matching not allowed: matching labels must be unique on one side
The fact that none of the graphs show data make me think that maybe the issue is with our build_info metric and how we use it in the query to generate the graph.
build_info{commit="",version="1.1.7",branch=""} 1
The issue might be specifically triggered by your setup then, it will be hard for me to reproduce it.
Can you try to:
on (instance, job)
clauses, and replace them with on (instance, job, metrics_path)
and see if that works? I don't understand how you can have (instance, job)
pairs that are not-unique, but it looks like that's what the error message is saying.
and see if that works? I
Yes this made it work! The legend says {function="batch_insert", module="myapp::mod::dao", version="1.1.7"}
Thanks for bearing with me! We always assumed that instance, job
would be enough to uniquely identify scrape jobs, but it looks like that assumption was wrong :) Now we'll just need to make sure that the metrics_path
label is a built-in default label and present in all versions of Prometheus that matter to our use cases. Thanks again :)
It's me who need to be thankful! Then I just adapt my dashboard and my team mates are ready to enjoy the new dashboard :) awesome!
Huh, I noticed another thing, the duration of my fn
is always shown as 0. I'd expect it to be > 0 when there are data points (hovering over the yellow line also shows 0):
Oh right, because your function calls probably always last less than the smallest bucket which is 0.005s
(on average you have 0.0045
according to this comment).
So if you want to have more granularity there, you should change the buckets in your exporter config:
/// use metrics_exporter_prometheus::PrometheusBuilder;
let prom = PrometheusBuilder::new();
prom
- .set_buckets(&[0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0])
+ .set_buckets(&[0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5])
.expect("Failed to set histogram buckets")
.install()
.expect("failed to install recorder/exporter");
There's no hard requirement on the values in the buckets, just:
0.02
must appear in the buckets)Thanks for making this clear to me. In production I will have longer runtimes so this should be fine :).
I stumbled over another issue. Metrics data is only shown when selecting "Last 7 days" in Grafana even then there a data points for the last days/hours. For another fn call I could not working it at all.
I stumbled over another issue. Metrics data is only shown when selecting "Last 7 days" in Grafana even then there a data points for the last days/hours. For another fn call I could not working it at all.
What is the scrape_interval
of your project? Can you try to change the query in the dashboard to replace the last_over_time(build_info[$__rate_interval]
bits with last_over_time(build_info[1h]
(1h
or anything longer than your scrape_interval really) and tell me if that works?
I'm running into a similar issue just now (haven't had time to look into that before sorry), and it might be the solution. It also looks like a commit tried to solve that kind of issue recently, but I don't know whether you tried running this version of the dashboard or if it was indeed the problem
My scrape interval is 1m :).
Straight using https://github.com/autometrics-dev/autometrics-shared/blob/62f4dfd0a52e23874b3925039c17c879ca7aea49/dashboards/Autometrics%20Function%20Explorer.json makes my dashboard show no data at all.
That's too bad about this, especially if you get no data at all, there's something we didn't think of when making the queries I'd guess.
For the record, I just learnt that using $__rate_interval
as we do in the dashboard is perfectly fine, if you configured the scrape interval of your Prometheus instance to match the 1m
you have:
Source of the screenshot with the explanation
Other than that, I wasn't able to reproduce the issue locally, so I'll go watch on our staging if the issue lives so I can debug that hands on
EDIT: the only way I was able to reproduce it, was to have a too short timespan for the ranges in queries. So I think that configuring the scrape timeout in your grafana "Prometheus data source" is going to fix most of the issues.
Gory details of my current root cause hypothesis below:
$__rate_interval
is the maximum of 4 * scrape_interval
and interval
(the automatic interval given the time range of the graph). When you use "Last 7 days" as the range of the graph, interval
became 5 minutes in my Grafana, which is typically more than the recommended interval for 1 minute scrape interval. When you go lower than this, interval
decreases and 4 * scrape_interval
is supposed to be the anchor value. With the default of 15s
, the default "anchor" is 1 minute, which means all the rate
computation the dashboard does only has 1 point in time to guess a rate; therefore, the dashboard can't compute rates and shows "no data". (And we internally never hit that issue because we're setting the interval programmatically).
All this to say "if your Prometheus data source already has 1 minute
as a scrape timeout in Grafana settings, I'm stumped again"
I stand corrected: The issue with no data being shown seems to be fixed with the updated query mentioned earlier this thread.
Still for one function I have a datapoint with count zero. Not sure if this is because I have forgotten to set the Prometheus buckets. The actual function call duration is around 12sec so I thought default settings are fine.
metrics endpoint output
# HELP function_calls_count Autometrics counter for tracking function calls
# TYPE function_calls_count counter
function_calls_count{function="receive_and_send_rows",module="myservice_sl_server::server",caller=""} 1
# HELP build_info Autometrics info metric for tracking software version and build details
# TYPE build_info gauge
build_info{commit="",version="0.1.0",branch=""} 1
# TYPE health gauge
health{target="Total"} 1
# HELP function_calls_duration Autometrics histogram for tracking function call duration
# TYPE function_calls_duration summary
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0.5"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0.9"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0.95"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0.99"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0.999"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="1"} 0
function_calls_duration_sum{function="receive_and_send_rows",module="myservice_sl_server::server"} 13.730152187
function_calls_duration_count{function="receive_and_send_rows",module="myservice_sl_server::server"} 1
Prometheus exporter setup
PrometheusBuilder::new()
.install()
.expect("failed to install recorder/exporter");
When going with the following, no data is shown but the function appears in my dropdown.
PrometheusBuilder::new()
.set_buckets(&[
0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0,
])
.expect("Failed to set histogram buckets")
.install()
.expect("failed to install recorder/exporter");
# HELP function_calls_count Autometrics counter for tracking function calls
# TYPE function_calls_count counter
function_calls_count{function="read_and_stream_rows",module="myservice_sl_server::db",caller=""} 1
# TYPE health gauge
health{target="Total"} 1
# HELP build_info Autometrics info metric for tracking software version and build details
# TYPE build_info gauge
build_info{commit="",version="0.1.0",branch=""} 1
# HELP function_calls_duration Autometrics histogram for tracking function call duration
# TYPE function_calls_duration histogram
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.005"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.01"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.025"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.05"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.075"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.1"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.25"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.5"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.75"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="1"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="2.5"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="5"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="7.5"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="10"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="+Inf"} 1
function_calls_duration_sum{function="read_and_stream_rows",module="myservice_sl_server::db"} 15.281207034
function_calls_duration_count{function="read_and_stream_rows",module="myservice_sl_server::db"} 1
Still for one function I have a datapoint with count zero. Not sure if this is because I have forgotten to set the Prometheus buckets.
No, don't worry about this. To help with metrics discovery, the library initializes the counters of autometricized functions at start in Debug builds. That allows the dashboards to show data even if nothing happened.
When going with the following, no data is shown but the function appears in my dropdown.
Did you check the settings for the Prometheus source in Grafana? I still think that changing the scrape interval to 1m
(or 60s
if it only takes seconds) like in the screenshot above would help with the "No data" issue
Our Prometheus uses a scrape interval of 60s.
First of all, thanks for making life easier :) I just gave it a first try and annotated a
fn
in my Rust app with autometrics. But in the dashboard no data is shown. Also all quantile metrics are set to 0 even the fn got some calls.Version:
autometrics = { version = "0.5.0", features = ["metrics"] }
1) Data from my apps Prometheus endpoint
2) My data is scraped 3) Dashboard shows no data