Closed bitsofdave closed 3 years ago
I also got this error:
E1015 02:29:50.039995 1 main.go:215] Could not build time series for component spin-clouddriver: text format parsing error in line 1038: second TYPE line for metric name "kubernetes_api_seconds", or TYPE reported after samples
I checked for the same type of response and found the same response type(kubernetes_api_seconds x2, kubernetes_api_seconds_max x 3).
$ curl localhost:7002/aop-prometheus | grep TYPE | grep kubernetes_api_seconds
# TYPE kubernetes_api_seconds summary
# TYPE kubernetes_api_seconds_max gauge
# TYPE kubernetes_api_seconds summary
# TYPE kubernetes_api_seconds_max gauge
# TYPE kubernetes_api_seconds summary
# TYPE kubernetes_api_seconds_max gauge
Armory Observability Plugin version: v1.0.0 Spinnaker version: 1.22.1
FYI: you can natively publish to NR bypassing the need for nri-prometheus. We ended up doing that as nri-prometheus couldn't read open metrics... version 2 format think it is. There's a number of long threaded chains on this on NR side.
On the multiple lines... yeah SOUNDS like a bug off hand... have to dig to confirm.
Sorry, I don't use nri-prometheus. I'm trying to use prometheus-to-sd (https://github.com/GoogleCloudPlatform/k8s-stackdriver/tree/master/prometheus-to-sd) and it shows the above error logs as "text format parsing error".
Yeah need to check but PRETTY sure there's a duplicate handling bug... thanks for the report, have an idea where this likely is...
I wonder if this is somehow related to the NaN
value for Orca JVM memory metrics. Other services for same metric name have correct metric values.
# HELP jvm_memory_used
# TYPE jvm_memory_used gauge
jvm_memory_used{hostname="orca-795df68cb8-5jvjs",id="Metaspace",lib="aop",libVer="v1.1.3",memtype="NON_HEAP",spinSvc="orca",version="1.0.0",} NaN
jvm_memory_used{hostname="orca-795df68cb8-5jvjs",id="G1 Survivor Space",lib="aop",libVer="v1.1.3",memtype="HEAP",spinSvc="orca",version="1.0.0",} NaN
jvm_memory_used{hostname="orca-795df68cb8-5jvjs",id="Compressed Class Space",lib="aop",libVer="v1.1.3",memtype="NON_HEAP",spinSvc="orca",version="1.0.0",} NaN
jvm_memory_used{hostname="orca-795df68cb8-5jvjs",id="CodeHeap 'non-profiled nmethods'",lib="aop",libVer="v1.1.3",memtype="NON_HEAP",spinSvc="orca",version="1.0.0",} NaN
jvm_memory_used{hostname="orca-795df68cb8-5jvjs",id="G1 Eden Space",lib="aop",libVer="v1.1.3",memtype="HEAP",spinSvc="orca",version="1.0.0",} NaN
jvm_memory_used{hostname="orca-795df68cb8-5jvjs",id="CodeHeap 'non-nmethods'",lib="aop",libVer="v1.1.3",memtype="NON_HEAP",spinSvc="orca",version="1.0.0",} NaN
jvm_memory_used{hostname="orca-795df68cb8-5jvjs",id="CodeHeap 'profiled nmethods'",lib="aop",libVer="v1.1.3",memtype="NON_HEAP",spinSvc="orca",version="1.0.0",} NaN
jvm_memory_used{hostname="orca-795df68cb8-5jvjs",id="G1 Old Gen",lib="aop",libVer="v1.1.3",memtype="HEAP",spinSvc="orca",version="1.0.0",} NaN
stage_invocations_duration_*
metrics are also duplicated at Orca.
$ curl localhost:8083/aop-prometheus | grep TYPE | grep stage | sort
# TYPE stage_invocations_duration_seconds_max gauge
# TYPE stage_invocations_duration_seconds_max gauge
# TYPE stage_invocations_duration_seconds summary
# TYPE stage_invocations_duration_seconds summary
# TYPE stage_invocations_duration_total counter
# TYPE stage_invocations_duration_total counter
# TYPE stage_invocations_total counter
# TYPE stage_invocations_total counter
task_completions_duration_*
have a WithType and some do not.
url localhost:8083/aop-prometheus | grep TYPE | grep task | sort
# TYPE orca_task_result_total counter
# TYPE task_completions_duration_seconds_max gauge
# TYPE task_completions_duration_seconds summary
# TYPE task_completions_duration_withType_seconds_max gauge
# TYPE task_completions_duration_withType_seconds summary
# TYPE task_invocations_duration_seconds_max gauge
# TYPE task_invocations_duration_seconds summary
# TYPE task_invocations_duration_withType_seconds_max gauge
# TYPE task_invocations_duration_withType_seconds summary
Shouldn't stage_invocations_duration_*
also have metrics with a WithType?
Distinctly possible - created a separate ticket on the NaN values - but I'm not seeing these in NewRelic. Will do some digging...
JUST to confirm to the issue:
The REAL issue is you CANNOT have multiple TYPE/HELP definitions. SO
# TYPE stage_invocations_duration_total counter
# TYPE stage_invocations_duration_total counter
Is illegal. However
stage_invocations_duration_total{bob="uncle"} 0
stage_invocations_duration_total{bob="your"} 0
Is perfectly legal. My guess is the translation of types on the rule. I've been swamped so not had a chance to debug.
This issue still exists spinnaker version: 1.26.4 nri-prometheus 2.7.0
FYI: We have finally started coming back on this and have a pretty good idea of what's up. Basically... the way this operates is a TOUCh tricky - since SOME registries allow their labels to change. What we're seeing is that in one registry (e.g. the Spectator) there's a metric say "memory" and in another meter (e.g. the default micro meter one for spring), it'd have the same metric name, but DIFFERENT labels. This causes the duplicate types and errors we're seeing. Least that's the teams theory at the moment ;). The trick is fixing it now and what's the right solution.
Armory Observability Plugin version:
v1.1.1-RC2
Spinnaker version:1.22.1
Using nri-prometheus which is New Relic's OpenMetrics Prometheus integration to scrape prometheus metrics from endpoints.
The integration is unable to parse metrics from the orca endpoint, due to this error:
Manually inspecting the metrics endpoint confirms that
stage_invocations_total
is defined multiple times, possibly once per application.This appears to start happening when you trigger a pipeline on 2 different applications.
I also confirmed this issue exists in plugin version
v1.0.0
.