SigNoz / signoz

SigNoz is an open-source observability platform native to OpenTelemetry with logs, traces and metrics in a single application. An open-source alternative to DataDog, NewRelic, etc. 🔥 🖥. 👉 Open source Application Performance Monitoring (APM) & Observability tool
https://signoz.io
Other
19.23k stars 1.27k forks source link

Can't find counter metrics in dashboard #1707

Closed nappa85 closed 1 year ago

nappa85 commented 2 years ago

I'm testing SigNoz 0.11.2 using docker-compose clickhouse environment (minus hotrod containers), along with other products, firing metrics to a single otel-collector instance with different exporters.

In my Rust code I've got those metrics:

static HTTP_COUNTER: Lazy<Counter<u64>> = Lazy::new(|| {
    get_meter()// retrieve common meter
        .u64_counter("http.hits")
        .with_description("Request hit counter")
        .with_unit(Unit::new("r"))
        .init()
});
static HTTP_REQ_HISTOGRAM: Lazy<ValueRecorder<f64>> = Lazy::new(|| {
    get_meter()// retrieve common meter
        .f64_value_recorder("service.duration")
        .with_description("Service request latencies")
        .with_unit(Unit::new("s"))
        .init()
});
static MYSQL_QUERY_HISTOGRAM: Lazy<ValueRecorder<f64>> = Lazy::new(|| {
    get_meter()// retrieve common meter
        .f64_value_recorder("mysql.duration")
        .with_description("MySQL query latencies")
        .with_unit(Unit::new("s"))
        .init()
});

fn metric_web(info: warp::log::Info<'_>) {
    debug!("served {} {} {:?}", info.method(), info.path(), info.status());

    let attributes = &[
        KeyValue::new("method", info.method().to_string()),
        KeyValue::new("path", info.path().to_owned()),
        KeyValue::new("status", info.status().as_u16() as i64),
        KeyValue::new("failed", !info.status().is_success()),
    ];

    HTTP_COUNTER.add(1, attributes);
    HTTP_REQ_HISTOGRAM.record(info.elapsed().as_secs_f64(), attributes);
}

fn metric_mysql(info: &sea_orm::metric::Info<'_>, pool: &'static str) {
    debug!(
        "mysql query{} on {} took {}s: {}",
        if info.failed { " failed" } else { "" },
        pool,
        info.elapsed.as_secs_f64(),
        info.statement.sql
    );

    MYSQL_QUERY_HISTOGRAM.record(
        info.elapsed.as_secs_f64(),
        &[
            KeyValue::new("query", info.statement.sql.clone()),
            KeyValue::new("pool", pool),
            KeyValue::new("failed", info.failed),
        ],
    );
}

On SigNoz I can find mysql.duration (called mysql_duration) and service.duration (called service_duration) metrics, but I can't find http.hits metrics, while on other products I can find all of them.

Is there some kind of filter on metric names? Am I doing something wrong?

welcome[bot] commented 2 years ago

Thanks for opening this issue. A team member should give feedback soon. In the meantime, feel free to check out the contributing guidelines.

pranay01 commented 2 years ago

@srikanthccv Would you have any insights on this?

srikanthccv commented 2 years ago

@nappa85 Please provide a fully reproducible example. I tried a basic OTLP example with an HTTP counter and could see the metric name http_hits.

Screenshot 2022-11-12 at 4 23 42 AM
nappa85 commented 1 year ago

You can find a MVP here: https://github.com/nappa85/otlp-test/ Start the service and then call on port 3000 to populate metrics SigNoz Uptrace

ankitnayan commented 1 year ago

@nappa85 the 2nd image is not from SigNoz's UI? How are you visualizing the data?

nappa85 commented 1 year ago

@nappa85 the 2nd image is not from SigNoz's UI? How are you visualizing the data?

Second image is from another product ingesting the very same OTLP data (I've an otlp-collector exporting to different products to make a comparison)

srikanthccv commented 1 year ago

Ah, I see you are using delta temporality in the code. We currently support the cumulative to remain compatible with Prometheus, its query language and ecosystem. So if it's possible, I would suggest you use cumulative (which is also default in OTEL). Otherwise, there needs to be another intermediate step which converts delta points to cumulative making it stateful, which makes it difficult to scale horizontally.

nappa85 commented 1 year ago

I've ended up using Delta after this issue https://github.com/open-telemetry/opentelemetry-rust/issues/677 I'll try with other configs and let you know

nappa85 commented 1 year ago

I went with default values, so aggregator selector exact and export kind stateless, and I have the same behavior image image

nappa85 commented 1 year ago

Adding some infos: With export_kind cumulative it works, but cumulates values over time. image In this dashboard we can see http.hits going up in time, but the request rate was constant. At half chart I switched back to export_kind stateless and the value disappeared

srikanthccv commented 1 year ago

In this dashboard we can see http.hits going up in time, but the request rate was constant

Is it constant? or is it not big enough to tell the difference in the same chart because y axis is skewed by noop line?

At half chart I switched back to export_kind stateless and the value disappeared

What is export kind stateless?

nappa85 commented 1 year ago

In this dashboard we can see http.hits going up in time, but the request rate was constant

Is it constant? or is it not big enough to tell the difference in the same chart because y axis is skewed by noop line?

I'm sending a request every 2 seconds, so it's constant

At half chart I switched back to export_kind stateless and the value disappeared

What is export kind stateless?

It's an opentelemetry crate concept, I thought it was standard also outside rust

nappa85 commented 1 year ago

Just to be fully transparent: I've made an MVP with opentelemetry 0.18.0, latest version, I wasn't using it because it's a big breaking change from 0.17.0 and it lacks documentation about metrics. You can find the code here https://github.com/nappa85/otlp-test/tree/v0.18.0 It fires every possible metric type (u64/f64)((observable)counter|histogram|observablegauge|(observable)up_down_counter), and for every metric type it aggregates 3 different ways: sum, last_value and histogram In total are 30 different metrics sent, all the integer metrics sends a constant 1 value, all the float metrics sends the same random value between 0 an 1 for every call The export kind is stateless, the default, so I'm no more using delta that you don't support (it was a limitation of 0.17.0)

I've configured SigNoz and another product doing a graph for every metric, grouping the 3 different aggregations. Data generation is made calling my program with curl using watch, that makes a call every 2 seconds.

watch curl 127.0.0.1:3000/a

In signoz I find histogram metrics split in 3 voices: _bucket, _count and _sum, so there are 5 lines per graph All metrics in SigNoz appears monotonic growing, even if I specified NOOP for all values, except for the observable_up_down_counter, here I really can't say what's going on. Screenshot_20221116_172810 image

With the exact same data, the other product gives a correct output Screenshot_20221116_172841 Screenshot_20221116_172859

I don't know if I'm doing something wrong with SigNoz, let me know

nappa85 commented 1 year ago

Just to be clear, someone pointed out it seems I'm telling that SigNoz doesn't works. No, I'm just pointing there are unclear/unfriendly things, I wrote this report at the end of my work day and I'm quite exhausted, so maybe I haven't choosen the best words... As a developer, I think this kind of feedback are really helpful.

srikanthccv commented 1 year ago

@nappa85 I am not sure I really understand what you are trying to convey. We appreciate any and all feedback. Our underlying data model closely follows OpenMetrics (~Prometheus) exposition format. So just using NOOP gives you the raw data, which is not very helpful. You need to change the aggregate operator depending on the metric type (usually RATE or SUM_RATE for counters, combined with histogram_quantile for histogram types etc.). Please go through these docs https://signoz.io/docs/userguide/create-a-custom-query/ and try to use the appropriate operator to plot the graphs. You also mentioned some rust SDK-specific things I am not fully aware of. If you are unsure about something and need help, please join our Slack channel and ask questions https://signoz.io/docs/community/#slack.

nappa85 commented 1 year ago

An example of what I'm trying to express:

From my application I produce this histogram, it's variable data, response timing, I need to see something like the average time, not only the last value for the time frame. I'm producing a random float value between 0 and 1, so, with standard distribution, the average will be near 0.5, and sending the exact same data to both products.

With the other product, I simply put the histogram data in the graph, it forces me to use an aggregation and by default uses p50, the result is already acceptable image

With SigNoz I find 3 different possible metrics: histogram_bucket, histogram_count and histogram_sum. I'm discarding count and sum becaure they aren't what I want, so, NOOP is obviously wrong, but if I plot it it's already a warning, why all values are growing like it's a count? image Even with P50 I have a growind graph image With RATE I have all the single values distinct image Strangely they are all under 0.5, but if I take a look to histogram_last, I see I have many values over 0.5 image Maybe I'm using the wrong aggregator? SUM_RATE goes too high image RATE_SUM even higher image RATE_AVG suspiciously always under 0.5 image

I can accept I'm using SigNoz the wrong way, I'm testing it since few days and I'm neither an expert in metrics, but it doesn't seem to be user friendly

srikanthccv commented 1 year ago

I think I understand the challenge you are facing, especially since you don't have prior experience prometheus or primal. You were expecting the aggregation can be directly based on the metric type instead of you trying to figure out what to do with the underlying raw data.

nappa85 commented 1 year ago

I think I understand the challenge you are facing, especially since you don't have prior experience prometheus or primal. You were expecting the aggregation can be directly based on the metric type instead of you trying to figure out what to do with the underlying raw data.

Can you elaborate more? Being the data the same, how can I reach the same result i get with the other product using SigNoz?