autometrics-dev / autometrics-shared

Resources used by all of the autometrics implementations
https://autometrics.dev
Apache License 2.0
11 stars 0 forks source link

no data on Functions explorer dashboard with Rust app #66

Open lcmgh opened 1 year ago

lcmgh commented 1 year ago

First of all, thanks for making life easier :) I just gave it a first try and annotated a fn in my Rust app with autometrics. But in the dashboard no data is shown. Also all quantile metrics are set to 0 even the fn got some calls.

Version: autometrics = { version = "0.5.0", features = ["metrics"] }

1) Data from my apps Prometheus endpoint

# TYPE function_calls_duration summary
function_calls_duration{function="batch_insert",module="myapp::mymod::dao",quantile="0"} 0
function_calls_duration{function="batch_insert",module="myapp::mymod::dao",quantile="0.5"} 0
function_calls_duration{function="batch_insert",module="myapp::mymod::dao",quantile="0.9"} 0
function_calls_duration{function="batch_insert",module="myapp::mymod::dao",quantile="0.95"} 0
function_calls_duration{function="batch_insert",module="myapp::mymod::dao",quantile="0.99"} 0
function_calls_duration{function="batch_insert",module="myapp::mymod::dao",quantile="0.999"} 0
function_calls_duration{function="batch_insert",module="myapp::mymod::dao",quantile="1"} 0
function_calls_duration_sum{function="batch_insert",module="myapp::mymod::dao"} 0.01188134
function_calls_duration_count{function="batch_insert",module="myapp::mymod::dao"} 3

2) My data is scraped 3) Dashboard shows no data

image

gagbo commented 1 year ago

Hello,

Thanks for the detailed bug report. I don't think I can take a deeper look at it before the end of the week, but a preliminary check I can do is:

Is there any specific reason you chose the metrics feature?

Can you try to use the prometheus feature instead and tell us if you have the same issue? If you don't have the issue anymore, I'll create an issue from here to autometrics-rs as it's a bug with the metrics feature. If you still have the issue, can you show the data sent by the Prometheus endpoint again?

lcmgh commented 1 year ago

Hi @gagbo! Thanks for checking back. I have used the metrics feature because my app has already a Prometheus exporter endpoint.

/// use metrics_exporter_prometheus::PrometheusBuilder;
let prom = PrometheusBuilder::new();
prom
    .install()
    .expect("failed to install recorder/exporter");

I was not sure if that feature exposes the Prometheus endpoint at a port and wanted not to have any interference here.

So switching to prometheus makes the autometrics metrics not appear anymore on my existing exporter endpoint.

After switching back to metrics it is similar to before

function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0.5"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0.9"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0.95"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0.99"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="0.999"} 0
function_calls_duration{function="batch_insert",module="myapp::mod::dao",quantile="1"} 0
function_calls_duration_sum{function="batch_insert",module="myapp::mod::dao"} 0.012257653
function_calls_duration_count{function="batch_insert",module="myapp::mod::dao"} 3
gagbo commented 1 year ago

That's weird too (that using the prometheus feature only doesn't work with the existing metrics setup). Are you using the default registry of Prometheus client library, or a separate custom library?

For the record, the prometheus feautures doesn't try to expose a scrapping endpoint, the prometheus-exporter feature does. It's currently documented under " For projects already using custom Prometheus metrics " in the README, not sure what we can do yet to make that clearer.

To go back on topic, can you instead try to change your prometheus exporter initialization:

/// use metrics_exporter_prometheus::PrometheusBuilder;
let prom = PrometheusBuilder::new();
prom
+    .set_buckets(&[0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0])
+    .expect("Failed to set histogram buckets")
    .install()
    .expect("failed to install recorder/exporter");

and see if it solves at least[^compat] the "function_calls_durations as summary" issue?

[^compat]: I think this might be incompatible with your existing metrics (because this apparently forces all histograms to be exported as histogram instead of summary), but if that's the case I'd rather handle that in a separate issue

lcmgh commented 1 year ago

I have adapted your changes and the metric endpoint now gives

function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.005"} 2
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.01"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.025"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.05"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.075"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.1"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.25"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.5"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.75"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="1"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="2.5"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="5"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="7.5"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="10"} 3
function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="+Inf"} 3
function_calls_duration_sum{function="batch_insert",module="myapp::mod::dao"} 0.013500508000000001
function_calls_duration_count{function="batch_insert",module="myapp::mod::dao"} 3
gagbo commented 1 year ago

Yeah, that's better, at least the duration related metrics should work. I still have 2 questions:

lcmgh commented 1 year ago

HELP function_calls_duration Autometrics histogram for tracking function call duration

TYPE function_calls_duration histogram

function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.005"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.01"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.025"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.05"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.075"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.1"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.25"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.5"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="0.75"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="1"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="2.5"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="5"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="7.5"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="10"} 1 function_calls_duration_bucket{function="batch_insert",module="myapp::mod::dao",le="+Inf"} 1 function_calls_duration_sum{function="batch_insert",module="myapp::mod::dao"} 0.002298264 function_calls_duration_count{function="batch_insert",module="myapp::mod::dao"} 1


* Dashboard still shows no data in all three panels

With applied code from https://github.com/autometrics-dev/autometrics-shared/issues/66#issuecomment-1611255460.
gagbo commented 1 year ago

I see that the output of the scrapping endpoint looks more like what I'm expecting, so this part is solved.

I'm trying to understand what happened to the dashboard, but I think I'll have to test this and reproduce to understand the issue.

Also, the sampling interval used to show the graph is dynamic, and changes with the timespan shown. Do you still have the issue if you change the "Last 6 hours" in the top-right corner to something significantly shorter like "Last 5 minutes"? I hope that it will give enough points (depending on your scrape_interval) to show a curve

lcmgh commented 1 year ago

Do you still have the issue if you change the "Last 6 hours" in the top-right corner to something significantly shorter like "Last 5 minutes"?

Oh, actually I now have found some lines in the Latency (95th and 99th Percentile) panel but still no data in the other ones. The other ones even show an error but only when going back to 7 days.

execution: found duplicate series for the match group {instance="xxxxx:10250", job="kubelet"} on the right hand-side of the operation: [{__name__="up", endpoint="https-metrics", instance="xxxx:10250", job="kubelet", metrics_path="/metrics/cadvisor", namespace="kube-system", node="xxxxxxxxxxx", service="prometheus-operator-kubelet"}, {__name__="up", endpoint="https-metrics", instance="xxxx", job="kubelet", metrics_path="/metrics/probes", namespace="kube-system", node="xxxxx", service="prometheus-operator-kubelet"}];many-to-many matching not allowed: matching labels must be unique on one side

The fact that none of the graphs show data make me think that maybe the issue is with our build_info metric and how we use it in the query to generate the graph.

build_info{commit="",version="1.1.7",branch=""} 1

gagbo commented 1 year ago

The issue might be specifically triggered by your setup then, it will be hard for me to reproduce it.

Can you try to:

and see if that works? I don't understand how you can have (instance, job) pairs that are not-unique, but it looks like that's what the error message is saying.

lcmgh commented 1 year ago

and see if that works? I

Yes this made it work! The legend says {function="batch_insert", module="myapp::mod::dao", version="1.1.7"}

gagbo commented 1 year ago

Thanks for bearing with me! We always assumed that instance, job would be enough to uniquely identify scrape jobs, but it looks like that assumption was wrong :) Now we'll just need to make sure that the metrics_path label is a built-in default label and present in all versions of Prometheus that matter to our use cases. Thanks again :)

lcmgh commented 1 year ago

It's me who need to be thankful! Then I just adapt my dashboard and my team mates are ready to enjoy the new dashboard :) awesome!

lcmgh commented 1 year ago

Huh, I noticed another thing, the duration of my fn is always shown as 0. I'd expect it to be > 0 when there are data points (hovering over the yellow line also shows 0):

temp
gagbo commented 1 year ago

Oh right, because your function calls probably always last less than the smallest bucket which is 0.005s (on average you have 0.0045 according to this comment).

So if you want to have more granularity there, you should change the buckets in your exporter config:

/// use metrics_exporter_prometheus::PrometheusBuilder;
let prom = PrometheusBuilder::new();
prom
-    .set_buckets(&[0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0])
+    .set_buckets(&[0.0005, 0.001, 0.0025, 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5])
    .expect("Failed to set histogram buckets")
    .install()
    .expect("failed to install recorder/exporter");

There's no hard requirement on the values in the buckets, just:

lcmgh commented 1 year ago

Thanks for making this clear to me. In production I will have longer runtimes so this should be fine :).

lcmgh commented 1 year ago

I stumbled over another issue. Metrics data is only shown when selecting "Last 7 days" in Grafana even then there a data points for the last days/hours. For another fn call I could not working it at all.

gagbo commented 1 year ago

I stumbled over another issue. Metrics data is only shown when selecting "Last 7 days" in Grafana even then there a data points for the last days/hours. For another fn call I could not working it at all.

What is the scrape_interval of your project? Can you try to change the query in the dashboard to replace the last_over_time(build_info[$__rate_interval] bits with last_over_time(build_info[1h] (1h or anything longer than your scrape_interval really) and tell me if that works?

I'm running into a similar issue just now (haven't had time to look into that before sorry), and it might be the solution. It also looks like a commit tried to solve that kind of issue recently, but I don't know whether you tried running this version of the dashboard or if it was indeed the problem

lcmgh commented 1 year ago

My scrape interval is 1m :).

Straight using https://github.com/autometrics-dev/autometrics-shared/blob/62f4dfd0a52e23874b3925039c17c879ca7aea49/dashboards/Autometrics%20Function%20Explorer.json makes my dashboard show no data at all.

gagbo commented 1 year ago

That's too bad about this, especially if you get no data at all, there's something we didn't think of when making the queries I'd guess.

For the record, I just learnt that using $__rate_interval as we do in the dashboard is perfectly fine, if you configured the scrape interval of your Prometheus instance to match the 1m you have:

Settings screen for a Prometheus data source in Grafana, showing the scrape interval field

Source of the screenshot with the explanation

Other than that, I wasn't able to reproduce the issue locally, so I'll go watch on our staging if the issue lives so I can debug that hands on

EDIT: the only way I was able to reproduce it, was to have a too short timespan for the ranges in queries. So I think that configuring the scrape timeout in your grafana "Prometheus data source" is going to fix most of the issues.

Gory details of my current root cause hypothesis below:

$__rate_interval is the maximum of 4 * scrape_interval and interval (the automatic interval given the time range of the graph). When you use "Last 7 days" as the range of the graph, interval became 5 minutes in my Grafana, which is typically more than the recommended interval for 1 minute scrape interval. When you go lower than this, interval decreases and 4 * scrape_interval is supposed to be the anchor value. With the default of 15s, the default "anchor" is 1 minute, which means all the rate computation the dashboard does only has 1 point in time to guess a rate; therefore, the dashboard can't compute rates and shows "no data". (And we internally never hit that issue because we're setting the interval programmatically).

All this to say "if your Prometheus data source already has 1 minute as a scrape timeout in Grafana settings, I'm stumped again"

lcmgh commented 1 year ago

I stand corrected: The issue with no data being shown seems to be fixed with the updated query mentioned earlier this thread.

Still for one function I have a datapoint with count zero. Not sure if this is because I have forgotten to set the Prometheus buckets. The actual function call duration is around 12sec so I thought default settings are fine.

image

metrics endpoint output

# HELP function_calls_count Autometrics counter for tracking function calls
# TYPE function_calls_count counter
function_calls_count{function="receive_and_send_rows",module="myservice_sl_server::server",caller=""} 1

# HELP build_info Autometrics info metric for tracking software version and build details
# TYPE build_info gauge
build_info{commit="",version="0.1.0",branch=""} 1

# TYPE health gauge
health{target="Total"} 1

# HELP function_calls_duration Autometrics histogram for tracking function call duration
# TYPE function_calls_duration summary
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0.5"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0.9"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0.95"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0.99"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="0.999"} 0
function_calls_duration{function="receive_and_send_rows",module="myservice_sl_server::server",quantile="1"} 0
function_calls_duration_sum{function="receive_and_send_rows",module="myservice_sl_server::server"} 13.730152187
function_calls_duration_count{function="receive_and_send_rows",module="myservice_sl_server::server"} 1

Prometheus exporter setup

PrometheusBuilder::new()
    .install()
    .expect("failed to install recorder/exporter");
lcmgh commented 1 year ago

When going with the following, no data is shown but the function appears in my dropdown.

PrometheusBuilder::new()
    .set_buckets(&[
        0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0,
    ])
    .expect("Failed to set histogram buckets")
    .install()
    .expect("failed to install recorder/exporter");
# HELP function_calls_count Autometrics counter for tracking function calls
# TYPE function_calls_count counter
function_calls_count{function="read_and_stream_rows",module="myservice_sl_server::db",caller=""} 1

# TYPE health gauge
health{target="Total"} 1

# HELP build_info Autometrics info metric for tracking software version and build details
# TYPE build_info gauge
build_info{commit="",version="0.1.0",branch=""} 1

# HELP function_calls_duration Autometrics histogram for tracking function call duration
# TYPE function_calls_duration histogram
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.005"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.01"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.025"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.05"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.075"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.1"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.25"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.5"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="0.75"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="1"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="2.5"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="5"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="7.5"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="10"} 0
function_calls_duration_bucket{function="read_and_stream_rows",module="myservice_sl_server::db",le="+Inf"} 1
function_calls_duration_sum{function="read_and_stream_rows",module="myservice_sl_server::db"} 15.281207034
function_calls_duration_count{function="read_and_stream_rows",module="myservice_sl_server::db"} 1
gagbo commented 1 year ago

Still for one function I have a datapoint with count zero. Not sure if this is because I have forgotten to set the Prometheus buckets.

No, don't worry about this. To help with metrics discovery, the library initializes the counters of autometricized functions at start in Debug builds. That allows the dashboards to show data even if nothing happened.

When going with the following, no data is shown but the function appears in my dropdown.

Did you check the settings for the Prometheus source in Grafana? I still think that changing the scrape interval to 1m (or 60s if it only takes seconds) like in the screenshot above would help with the "No data" issue

lcmgh commented 11 months ago

Our Prometheus uses a scrape interval of 60s.