armory-plugins / armory-observability-plugin

Spinnaker plugin for enabling, configuring, and customizing observability features.
Apache License 2.0
20 stars 16 forks source link

text format parsing error in line 289: second TYPE line for metric name \"stage_invocations_total\" #26

Closed bitsofdave closed 3 years ago

bitsofdave commented 4 years ago

Armory Observability Plugin version: v1.1.1-RC2 Spinnaker version: 1.22.1

Using nri-prometheus which is New Relic's OpenMetrics Prometheus integration to scrape prometheus metrics from endpoints.

The integration is unable to parse metrics from the orca endpoint, due to this error:

text format parsing error in line 289: second TYPE line for metric name \"stage_invocations_total\", or TYPE reported after samples

Manually inspecting the metrics endpoint confirms that stage_invocations_total is defined multiple times, possibly once per application.

This appears to start happening when you trigger a pipeline on 2 different applications.

I also confirmed this issue exists in plugin version v1.0.0.

Mahito commented 4 years ago

I also got this error:

E1015 02:29:50.039995       1 main.go:215] Could not build time series for component spin-clouddriver: text format parsing error in line 1038: second TYPE line for metric name "kubernetes_api_seconds", or TYPE reported after samples

I checked for the same type of response and found the same response type(kubernetes_api_seconds x2, kubernetes_api_seconds_max x 3).

$ curl localhost:7002/aop-prometheus | grep TYPE | grep kubernetes_api_seconds
# TYPE kubernetes_api_seconds summary
# TYPE kubernetes_api_seconds_max gauge
# TYPE kubernetes_api_seconds summary
# TYPE kubernetes_api_seconds_max gauge
# TYPE kubernetes_api_seconds summary
# TYPE kubernetes_api_seconds_max gauge

Armory Observability Plugin version: v1.0.0 Spinnaker version: 1.22.1

jasonmcintosh commented 4 years ago

FYI: you can natively publish to NR bypassing the need for nri-prometheus. We ended up doing that as nri-prometheus couldn't read open metrics... version 2 format think it is. There's a number of long threaded chains on this on NR side.

On the multiple lines... yeah SOUNDS like a bug off hand... have to dig to confirm.

Mahito commented 4 years ago

Sorry, I don't use nri-prometheus. I'm trying to use prometheus-to-sd (https://github.com/GoogleCloudPlatform/k8s-stackdriver/tree/master/prometheus-to-sd) and it shows the above error logs as "text format parsing error".

jasonmcintosh commented 4 years ago

Yeah need to check but PRETTY sure there's a duplicate handling bug... thanks for the report, have an idea where this likely is...

karlskewes commented 4 years ago

I wonder if this is somehow related to the NaN value for Orca JVM memory metrics. Other services for same metric name have correct metric values.

# HELP jvm_memory_used
# TYPE jvm_memory_used gauge
jvm_memory_used{hostname="orca-795df68cb8-5jvjs",id="Metaspace",lib="aop",libVer="v1.1.3",memtype="NON_HEAP",spinSvc="orca",version="1.0.0",} NaN
jvm_memory_used{hostname="orca-795df68cb8-5jvjs",id="G1 Survivor Space",lib="aop",libVer="v1.1.3",memtype="HEAP",spinSvc="orca",version="1.0.0",} NaN
jvm_memory_used{hostname="orca-795df68cb8-5jvjs",id="Compressed Class Space",lib="aop",libVer="v1.1.3",memtype="NON_HEAP",spinSvc="orca",version="1.0.0",} NaN
jvm_memory_used{hostname="orca-795df68cb8-5jvjs",id="CodeHeap 'non-profiled nmethods'",lib="aop",libVer="v1.1.3",memtype="NON_HEAP",spinSvc="orca",version="1.0.0",} NaN
jvm_memory_used{hostname="orca-795df68cb8-5jvjs",id="G1 Eden Space",lib="aop",libVer="v1.1.3",memtype="HEAP",spinSvc="orca",version="1.0.0",} NaN
jvm_memory_used{hostname="orca-795df68cb8-5jvjs",id="CodeHeap 'non-nmethods'",lib="aop",libVer="v1.1.3",memtype="NON_HEAP",spinSvc="orca",version="1.0.0",} NaN
jvm_memory_used{hostname="orca-795df68cb8-5jvjs",id="CodeHeap 'profiled nmethods'",lib="aop",libVer="v1.1.3",memtype="NON_HEAP",spinSvc="orca",version="1.0.0",} NaN
jvm_memory_used{hostname="orca-795df68cb8-5jvjs",id="G1 Old Gen",lib="aop",libVer="v1.1.3",memtype="HEAP",spinSvc="orca",version="1.0.0",} NaN
Mahito commented 4 years ago

stage_invocations_duration_* metrics are also duplicated at Orca.

$ curl localhost:8083/aop-prometheus | grep TYPE | grep stage | sort
# TYPE stage_invocations_duration_seconds_max gauge
# TYPE stage_invocations_duration_seconds_max gauge
# TYPE stage_invocations_duration_seconds summary
# TYPE stage_invocations_duration_seconds summary
# TYPE stage_invocations_duration_total counter
# TYPE stage_invocations_duration_total counter
# TYPE stage_invocations_total counter
# TYPE stage_invocations_total counter

task_completions_duration_* have a WithType and some do not.

url localhost:8083/aop-prometheus | grep TYPE | grep task | sort
# TYPE orca_task_result_total counter
# TYPE task_completions_duration_seconds_max gauge
# TYPE task_completions_duration_seconds summary
# TYPE task_completions_duration_withType_seconds_max gauge
# TYPE task_completions_duration_withType_seconds summary
# TYPE task_invocations_duration_seconds_max gauge
# TYPE task_invocations_duration_seconds summary
# TYPE task_invocations_duration_withType_seconds_max gauge
# TYPE task_invocations_duration_withType_seconds summary

Shouldn't stage_invocations_duration_* also have metrics with a WithType?

jasonmcintosh commented 4 years ago

Distinctly possible - created a separate ticket on the NaN values - but I'm not seeing these in NewRelic. Will do some digging...

jasonmcintosh commented 4 years ago

JUST to confirm to the issue:

The REAL issue is you CANNOT have multiple TYPE/HELP definitions. SO

# TYPE stage_invocations_duration_total counter
# TYPE stage_invocations_duration_total counter

Is illegal. However

stage_invocations_duration_total{bob="uncle"} 0
stage_invocations_duration_total{bob="your"} 0

Is perfectly legal. My guess is the translation of types on the rule. I've been swamped so not had a chance to debug.

gdziwoki commented 3 years ago

This issue still exists spinnaker version: 1.26.4 nri-prometheus 2.7.0

jasonmcintosh commented 3 years ago

FYI: We have finally started coming back on this and have a pretty good idea of what's up. Basically... the way this operates is a TOUCh tricky - since SOME registries allow their labels to change. What we're seeing is that in one registry (e.g. the Spectator) there's a metric say "memory" and in another meter (e.g. the default micro meter one for spring), it'd have the same metric name, but DIFFERENT labels. This causes the duplicate types and errors we're seeing. Least that's the teams theory at the moment ;). The trick is fixing it now and what's the right solution.