Closed sophieyfang closed 2 years ago
One way for the fix is to use so that all are in double for consistency to avoid conflicts
operation:
- action: toggle_scalar_data_type
More info on this toggle_scalar_data_type
processor: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstransformprocessor#toggle-datatype
@dehaansa @Mrod1598, likely there is an issue in the JMX Metric Gatherer that applies more broadly than just this one case. Will one of you please take this issue?
Yeah, we found a few other metrics that are ingesting DOUBLE while it's supposed to be INT.
Am I understanding correctly that this has only been observed when the first data point is 0.00
?
The problem is with metricstransform/jvm__pipeline_jvm__metrics_1
sophieyfang_google_com@debian10-jvm:~$ cat otel.yaml
exporters:
googlecloud:
metric:
prefix: ""
user_agent: Google-Cloud-Ops-Agent-Metrics/latest (BuildDistro=build_distro;Platform=linux;ShortName=linux_platform;ShortVersion=linux_platform_version)
logging:
loglevel: debug
processors:
filter/jvm:
metrics:
include:
match_type: strict
metric_names:
- jvm.memory.pool.used
resourcedetection/_global_0:
detectors:
- gce
metricstransform/jvm__pipeline_jvm__metrics_1:
transforms:
- action: update
include: ^(.*)$$
match_type: regexp
new_name: workload.googleapis.com/$${1}
normalizesums/jvm__pipeline_jvm__metrics_0: {}
receivers:
jmx/jvm__pipeline_jvm__metrics:
collection_interval: 30s
endpoint: service:jmx:rmi:///jndi/rmi://127.0.0.1:9010/jmxrmi
jar_path: /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/opentelemetry-java-contrib-jmx-metrics.jar
target_system: jvm
service:
pipelines:
metrics/jvm__pipeline_jvm__metrics:
exporters:
- googlecloud
- logging
processors:
- filter/jvm
- metricstransform/jvm__pipeline_jvm__metrics_1 #with this line 0.0000, without this line no 0.00000
receivers:
- jmx/jvm__pipeline_jvm__metrics
One way for the fix is to use so that all are in double for consistency to avoid conflicts
operation: - action: toggle_scalar_data_type
More info on this
toggle_scalar_data_type
processor: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstransformprocessor#toggle-datatype
This method won't help as it's toggling int to double and double to int which result in:
2021-12-20T16:53:25.510Z DEBUG loggingexporter/logging_exporter.go:66 ResourceMetrics #0
Resource labels:
-> service.name: STRING(unknown_service:java)
-> telemetry.sdk.version: STRING(1.7.0)
-> telemetry.sdk.language: STRING(java)
-> telemetry.sdk.name: STRING(opentelemetry)
InstrumentationLibraryMetrics #0
InstrumentationLibrary
Metric #0
Descriptor:
-> Name: workload.googleapis.com/jvm.memory.pool.used
-> Description: current memory pool usage
-> Unit: by
-> DataType: Gauge
NumberDataPoints #0
Data point attributes:
-> name: STRING(Eden Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:25.505 +0000 UTC
Value: 0
NumberDataPoints #1
Data point attributes:
-> name: STRING(CodeHeap 'profiled nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:25.505 +0000 UTC
Value: 0
NumberDataPoints #2
Data point attributes:
-> name: STRING(Metaspace)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:25.505 +0000 UTC
Value: 0
NumberDataPoints #3
Data point attributes:
-> name: STRING(Survivor Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:25.505 +0000 UTC
Value: 0
NumberDataPoints #4
Data point attributes:
-> name: STRING(Tenured Gen)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:25.505 +0000 UTC
Value: 0
NumberDataPoints #5
Data point attributes:
-> name: STRING(CodeHeap 'non-nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:25.505 +0000 UTC
Value: 0
NumberDataPoints #6
Data point attributes:
-> name: STRING(CodeHeap 'non-profiled nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:25.505 +0000 UTC
Value: 0
NumberDataPoints #7
Data point attributes:
-> name: STRING(Compressed Class Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:25.505 +0000 UTC
Value: 0
2021-12-20T16:53:30.081Z INFO loggingexporter/logging_exporter.go:56 MetricsExporter {"#metrics": 1}
2021-12-20T16:53:30.081Z DEBUG loggingexporter/logging_exporter.go:66 ResourceMetrics #0
Resource labels:
-> service.name: STRING(unknown_service:java)
-> telemetry.sdk.version: STRING(1.7.0)
-> telemetry.sdk.language: STRING(java)
-> telemetry.sdk.name: STRING(opentelemetry)
InstrumentationLibraryMetrics #0
InstrumentationLibrary
Metric #0
Descriptor:
-> Name: workload.googleapis.com/jvm.memory.pool.used
-> Description: current memory pool usage
-> Unit: by
-> DataType: Gauge
NumberDataPoints #0
Data point attributes:
-> name: STRING(Eden Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:30.071 +0000 UTC
Value: 515480.000000
NumberDataPoints #1
Data point attributes:
-> name: STRING(CodeHeap 'profiled nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:30.071 +0000 UTC
Value: 2459520.000000
NumberDataPoints #2
Data point attributes:
-> name: STRING(Metaspace)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:30.071 +0000 UTC
Value: 6651576.000000
NumberDataPoints #3
Data point attributes:
-> name: STRING(Survivor Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:30.071 +0000 UTC
Value: 1966072.000000
NumberDataPoints #4
Data point attributes:
-> name: STRING(Tenured Gen)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:30.071 +0000 UTC
Value: 519896.000000
NumberDataPoints #5
Data point attributes:
-> name: STRING(CodeHeap 'non-nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:30.071 +0000 UTC
Value: 1213312.000000
NumberDataPoints #6
Data point attributes:
-> name: STRING(CodeHeap 'non-profiled nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:30.071 +0000 UTC
Value: 539136.000000
NumberDataPoints #7
Data point attributes:
-> name: STRING(Compressed Class Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:30.071 +0000 UTC
Value: 766512.000000
sorry for assigning the task to @quentinmit , for one second i thought it's agentmetricsprocessor owned by us, but it's actually metricsprocessor.
Per discussion offline, @dehaansa is looking into this now.
Ok so I've been able to replicate the issue reliably using the above instructions. I found that the 1.7.0 version of the opentelemetry-java-contrib-jmx-metrics.jar (the one currently packaged with ops agent) exhibits this issue. The latest version 1.9.0 does not have this issues.
From what I can tell for JMX metrics that are not currently available 1.7.0 returns a nil value which ends up translating to a none-type metric in the metric transformer (here) instead of an Int. So the transformer assumes if it's a gauge
metric and it doesn't have all int data points it must be a double, shown here.
In the 1.9.0 version it returns an int 0 for metrics that aren't available and so the gauge is reported as an int. I couldn't find a changelog entry that may have be the reason for this in the 1.7.0...1.9.0 changelog. My guess is it was something in the upgrade from opentelmetry-java 1.7.0 to 1.9.0 but nothing stood out to me from that changelog either.
@sophieyfang I would recommend updating to v1.9.0 of the jmx agent. I could not reproduce this issue at all with that version and could reliably do so with 1.7.0.
The JMX component should be being updated in https://github.com/GoogleCloudPlatform/ops-agent/pull/315 so we should be able to validate either in builds based on that PR or once it has been merged in.
From what I can tell for JMX metrics that are not currently available 1.7.0 returns a nil value which ends up translating to a none-type metric in the metric transformer (here) instead of an Int. So the transformer assumes if it's a
gauge
metric and it doesn't have all int data points it must be a double, shown here.
What does "not currently available" mean in this context? I don't understand how either none or 0 make sense; if the metric doesn't exist it doesn't exist. I'm worried that a 0 in a metric, whether it's a double or an int, could cause alerts to fire/graphs to be wrong.
Yeah let me elaborate. The metric that caused this issue was related to garbage collection. When I connected jconsole to the JMX endpoint the metric in question showed its value as UNAVAILABLE
. Eventually a value did appear. My guess is that the metric doesn't have a value until a GC occurs.
if the metric doesn't exist it doesn't exist. I'm worried that a 0 in a metric, whether it's a double or an int, could cause alerts to fire/graphs to be wrong.
I agree with this and it's likely a larger issue with how the java-contrib jmx collector works. I haven't dug in fully but from what I've observed the jmx collector will always report all the metrics it's suppose to collect even if they don't have a value.
So I think there's two larger issues:
My proposal is update the jmx jar now to mitigate this issue for now. I know it's not perfect solution but it's better than the current behavior. Also, we should file issues for the two points above. I'll take on writing up issues for those two items above unless someone feels like that's not the right course of action.
Just to follow up on the 0 vs nil for no data. Looking at jconsole it also reports 0 for this metric during the times that it has no value. I still agree we should figure out the no data thing but 1.9.0 reporting a 0 in that instance at least is consistent with what a user would see in jconsole for this metric.
This is a good discussion for our metric data model (cc @jmacd ).
Effectively if we don't have a known value, we should report "not a known value" vs. 0 or other magic number. Specifically we can model this in the latest OTLP - https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/metrics/v1/metrics.proto#L325
There's a bit-flag field on all number data points that can report "no value seen", and that's what this should be using. I don't think there's a way to hook this up in the API yet, but we should open a bug against BOTH the Java SDK + the Metrics Specification to allow this (cc @reyang)
Just to follow up on the 0 vs nil for no data. Looking at jconsole it also reports 0 for this metric during the times that it has no value. I still agree we should figure out the no data thing but 1.9.0 reporting a 0 in that instance at least is consistent with what a user would see in jconsole for this metric.
Does that suggest there's a distinction between "metric doesn't exist" and "metric exists but has no value" in JMX?
In the latter case we could use the metric's existence to create a metric descriptor, but without a value I'm not sure we know what data type to use. In the absence of that I think we would have to treat both cases as "metric doesn't exist".
@jsuereth thanks for adding me here. I don't have the full context but I guess the question for me is whether we should allow the API to report "no value seen", especially for asynchronous instruments.
My current answer is no. I think if there is no value, the API shouldn't report anything and here is one example/reason:
An IoT (Internet of Things) device is reporting room temperature (in Fahrenheit) from a house, normally the data would look like: {"room1": 68, "room2": 69, "room3": 72, "room4": 86}
. Now the sensor in room4 has hardware failure, so the temperature information is not available anymore.
If we don't report "no value", we will end up having {"room1": 68, "room2": 69, "room3": 72}
, and we can still deduce min, max, average, and we also have the information that room4 didn't provide data.
If we report "no value", we might end up with `{"room1": 68, "room2": 69, "room3": 72, "room4": NoValue/null/nil/NaN/etc.}
, the value for room4 will poison all the aggregations and makes it harder to define what's the behavior (and whether there are options to tweak the behavior) (e.g. should "max" be 72 or Unknown/NaN?)
Look like @dehaansa ran into the issue with the latest release of the jmx jar on a GCE instance. We could not reproduce the issue locally on our macs. I have a suspicion something is going wrong in the data transfer from jmx to collector-contrib. The issue is that one of the datapoints has a pdata.MetricValueType
of nil` when it reaches the metric transform processor.
If I capture the logging output of the jmx collection straight from the jar the datapoint seems fine with a metric value of 0
. I think something is going wrong in the IPC handoff of data. I'm going to continue investigating.
Upon relooking into this issues I've found some weird behavior. I've follow the metric collection in a debugger from the jxm jar to the grpc export call and then picked it up on the Go side when the grpc receive handler is called. Here's what I saw:
-1
-1
all the way down to where it's serialized into bytes in the NumberDataPointMarshaler::writeToThis causes the metric transform further down the pipeline to think the gauge is not all ints so it turns them into doubles.
I could not replicate this behavior with a v1.9.0+ jmx jar. I did notice the numeric value was still -1
in the jmx v1.9.0 jar but ended up being 0
when it came out of the gRPC server on the Go side.
I'm not actually sure what's causing this. I couldn't pinpoint exactly where things were changing. There are few more java calls beyond past what I looked. They handle the serialization to proto but the data is all bytes at that point it was hard to figure out what was going on.
Verified that it's fixed in the ops-agent 2.9.0
which uses v1.9.0+ jmx jar
Thanks @cpheps for investing deeply into this!!
Description
jvm.memory.pool.used
emits Double then Integer. As the metrics descriptor is automatically created per the first time it receives the timeseries. For this case, the initial points are having value 0.00 which will create the metrics descriptor in valueType of Double see below. However, the points following are integers which get denied by the GCM API with below error msgsNote: I have saved all points before it sent out, See Additional context for the behavior of initial 0.00 points then integer values.
Metrics Descriptor - jvm.memory.pool.used
Steps to reproduce
Inside the GCE VM
Expectation It shall only generate jvm.memory.pool.used in integer values:
What applicable config did you use? Config: (e.g. the yaml config file)
Relevant Environment Information sophieyfang_google_com@debian10-meow:~$ cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 10 (buster)" NAME="Debian GNU/Linux" VERSION_ID="10" VERSION="10 (buster)" VERSION_CODENAME=buster ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"
Additional context