GoogleCloudPlatform / ops-agent

Apache License 2.0
141 stars 68 forks source link

jvm.memory.pool.used emits values in Double then Integer #288

Closed sophieyfang closed 2 years ago

sophieyfang commented 2 years ago

Description jvm.memory.pool.used emits Double then Integer. As the metrics descriptor is automatically created per the first time it receives the timeseries. For this case, the initial points are having value 0.00 which will create the metrics descriptor in valueType of Double see below. However, the points following are integers which get denied by the GCM API with below error msgs

Nov  2 14:36:39 app-admin-01 otelopscol[5791]: 2021-11-02T14:36:39.337Z#011info#011exporterhelper/queued_retry.go:231#011Exporting failed. Will retry the request after interval.#011{"kind": "exporter", "name": "googlecloud", "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older end time than the most recent point.: timeSeries[0-31]; Value type for metric workload.googleapis.com/jvm.memory.pool.used must be DOUBLE, but is INT64.: timeSeries[32-37]\nerror details: name = Unknown  desc = total_point_count:38 errors:{status:{code:3} point_count:38}", "interval": "22.291635104s"}

Note: I have saved all points before it sent out, See Additional context for the behavior of initial 0.00 points then integer values.

Metrics Descriptor - jvm.memory.pool.used

{
  "name": "projects/sophieyfang-test/metricDescriptors/workload.googleapis.com/jvm.memory.pool.used",
  "labels": [
    {
      "key": "name"
    }
  ],
  "metricKind": "GAUGE",
  "valueType": "DOUBLE",
  "type": "workload.googleapis.com/jvm.memory.pool.used",
  "monitoredResourceTypes": [
    "anthos_l4lb",
    "apigee.googleapis.com/Environment",
    "apigee.googleapis.com/Proxy",
    "apigee.googleapis.com/ProxyV2",
    "apigee.googleapis.com/TargetV2",
    "aws_alb_load_balancer",
    "aws_cloudfront_distribution",
    ........
}

Steps to reproduce

Inside the GCE VM

sophieyfang_google_com@ubuntu-1804-jvm:~$ sudo apt update
sophieyfang_google_com@ubuntu-1804-jvm:~$ sudo apt install -y default-jdk
sophieyfang_google_com@ubuntu-1804-jvm:~$ cat <<EOF > hello.java
> class HelloWorld {
>   public static void main(String args[]) throws InterruptedException {
>     while (true) {
>       Thread.sleep(1000);
>     }
>   }
> }
> EOF
sophieyfang_google_com@ubuntu-1804-jvm:~$ javac hello.java
sophieyfang_google_com@ubuntu-1804-jvm-clean:~$ java -ea -Dcom.sun.management.jmxremote.port=9010 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=127.0.0.1 HelloWorld > /dev/null 2>&1 &
sophieyfang_google_com@ubuntu-1804-jvm:~$ curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sophieyfang_google_com@ubuntu-1804-jvm:~$ sudo bash add-google-cloud-ops-agent-repo.sh --also-install  --version=2.7.0
sophieyfang_google_com@ubuntu-1804-jvm:~$ sudo systemctl stop google-cloud-ops-agent.service
sophieyfang_google_com@debian10-meow:~$ /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --add-instance-id=false --config=otel.yaml 

Expectation It shall only generate jvm.memory.pool.used in integer values: image

What applicable config did you use? Config: (e.g. the yaml config file)

sophieyfang@sophieyfang:~$ cat /tmp/otel.yaml 
exporters:
  logging:
    loglevel: debug
processors:
  filter/jvm:
    metrics:
      include:
        match_type: strict
        metric_names:
        - workload.googleapis.com/jvm.memory.pool.used
  metricstransform/jvm__pipeline_jvm__metrics_1:
    transforms:
    - action: update
      include: ^(.*)$$
      match_type: regexp
      new_name: workload.googleapis.com/$${1}
  normalizesums/jvm__pipeline_jvm__metrics_0: {}
receivers:
  jmx/jvm__pipeline_jvm__metrics:
    collection_interval: 30s
    endpoint: service:jmx:rmi:///jndi/rmi://127.0.0.1:9010/jmxrmi
    jar_path: /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/opentelemetry-java-contrib-jmx-metrics.jar
    target_system: jvm
service:
  pipelines:
    metrics/jvm__pipeline_jvm__metrics:
      exporters:
      - logging
      processors:
      - normalizesums/jvm__pipeline_jvm__metrics_0
      - metricstransform/jvm__pipeline_jvm__metrics_1
      - filter/jvm
      receivers:
      - jmx/jvm__pipeline_jvm__metrics

Relevant Environment Information sophieyfang_google_com@debian10-meow:~$ cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 10 (buster)" NAME="Debian GNU/Linux" VERSION_ID="10" VERSION="10 (buster)" VERSION_CODENAME=buster ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/"

Additional context

sophieyfang_google_com@debian10-meow:~$ /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/otelopscol --add-instance-id=false --config=otel.yaml 
2021-12-09T22:37:55.137Z    info    service/collector.go:174    Applying configuration...
2021-12-09T22:37:55.138Z    info    builder/exporters_builder.go:259    Exporter was built. {"kind": "exporter", "name": "logging"}
2021-12-09T22:37:55.138Z    info    filterprocessor@v0.37.1-0.20211028205244-e6fab4102b84/filter_processor.go:73    Metric filter configured    {"kind": "processor", "name": "filter/jvm", "include match_type": "strict", "include expressions": [], "include metric names": ["workload.googleapis.com/jvm.memory.pool.used"], "include metrics with resource attributes": null, "exclude match_type": "", "exclude expressions": [], "exclude metric names": [], "exclude metrics with resource attributes": null}
2021-12-09T22:37:55.138Z    info    builder/pipelines_builder.go:220    Pipeline was built. {"pipeline_name": "metrics/jvm__pipeline_jvm__metrics", "pipeline_datatype": "metrics"}
2021-12-09T22:37:55.139Z    info    builder/receivers_builder.go:228    Receiver was built. {"kind": "receiver", "name": "jmx/jvm__pipeline_jvm__metrics", "datatype": "metrics"}
2021-12-09T22:37:55.139Z    info    service/service.go:86   Starting extensions...
2021-12-09T22:37:55.139Z    info    service/service.go:91   Starting exporters...
2021-12-09T22:37:55.139Z    info    builder/exporters_builder.go:40 Exporter is starting... {"kind": "exporter", "name": "logging"}
2021-12-09T22:37:55.140Z    info    builder/exporters_builder.go:48 Exporter started.   {"kind": "exporter", "name": "logging"}
2021-12-09T22:37:55.140Z    info    service/service.go:96   Starting processors...
2021-12-09T22:37:55.140Z    info    builder/pipelines_builder.go:52 Pipeline is starting... {"pipeline_name": "metrics/jvm__pipeline_jvm__metrics", "pipeline_datatype": "metrics"}
2021-12-09T22:37:55.140Z    info    builder/pipelines_builder.go:63 Pipeline is started.    {"pipeline_name": "metrics/jvm__pipeline_jvm__metrics", "pipeline_datatype": "metrics"}
2021-12-09T22:37:55.141Z    info    service/service.go:101  Starting receivers...
2021-12-09T22:37:55.141Z    info    builder/receivers_builder.go:68 Receiver is starting... {"kind": "receiver", "name": "jmx/jvm__pipeline_jvm__metrics"}
2021-12-09T22:37:55.141Z    info    otlpreceiver/otlp.go:68 Starting GRPC server on endpoint 0.0.0.0:41229  {"kind": "receiver", "name": "jmx/jvm__pipeline_jvm__metrics"}
2021-12-09T22:37:55.142Z    info    builder/receivers_builder.go:73 Receiver started.   {"kind": "receiver", "name": "jmx/jvm__pipeline_jvm__metrics"}
2021-12-09T22:37:55.143Z    info    service/telemetry.go:92 Setting up own telemetry...
2021-12-09T22:37:55.144Z    info    service/telemetry.go:116    Serving Prometheus metrics  {"address": ":8888", "level": "basic", "service.instance.id": "", "service.version": "latest"}
2021-12-09T22:37:55.144Z    info    service/collector.go:230    Starting google-cloud-metrics-agent...  {"Version": "latest", "NumCPU": 1}
2021-12-09T22:37:55.146Z    info    service/collector.go:132    Everything is ready. Begin running and processing data.
2021-12-09T22:38:00.200Z    INFO    loggingexporter/logging_exporter.go:56  MetricsExporter {"#metrics": 1}
2021-12-09T22:38:00.200Z    DEBUG   loggingexporter/logging_exporter.go:66  ResourceMetrics #0
Resource labels:
     -> service.name: STRING(unknown_service:java)
     -> telemetry.sdk.version: STRING(1.7.0)
     -> telemetry.sdk.language: STRING(java)
     -> telemetry.sdk.name: STRING(opentelemetry)
InstrumentationLibraryMetrics #0
InstrumentationLibrary  
Metric #0
Descriptor:
     -> Name: workload.googleapis.com/jvm.memory.pool.used
     -> Description: current memory pool usage
     -> Unit: by
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> name: STRING(Eden Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:37:59.731 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> name: STRING(CodeHeap 'profiled nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:37:59.731 +0000 UTC
Value: 0.000000
NumberDataPoints #2
Data point attributes:
     -> name: STRING(Metaspace)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:37:59.731 +0000 UTC
Value: 0.000000
NumberDataPoints #3
Data point attributes:
     -> name: STRING(Survivor Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:37:59.731 +0000 UTC
Value: 0.000000
NumberDataPoints #4
Data point attributes:
     -> name: STRING(Tenured Gen)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:37:59.731 +0000 UTC
Value: 0.000000
NumberDataPoints #5
Data point attributes:
     -> name: STRING(CodeHeap 'non-nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:37:59.731 +0000 UTC
Value: 0.000000
NumberDataPoints #6
Data point attributes:
     -> name: STRING(CodeHeap 'non-profiled nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:37:59.731 +0000 UTC
Value: 0.000000
NumberDataPoints #7
Data point attributes:
     -> name: STRING(Compressed Class Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:37:59.731 +0000 UTC
Value: 0.000000
2021-12-09T22:38:25.922Z    INFO    loggingexporter/logging_exporter.go:56  MetricsExporter {"#metrics": 1}
2021-12-09T22:38:25.922Z    DEBUG   loggingexporter/logging_exporter.go:66  ResourceMetrics #0
Resource labels:
     -> service.name: STRING(unknown_service:java)
     -> telemetry.sdk.version: STRING(1.7.0)
     -> telemetry.sdk.language: STRING(java)
     -> telemetry.sdk.name: STRING(opentelemetry)
InstrumentationLibraryMetrics #0
InstrumentationLibrary  
Metric #0
Descriptor:
     -> Name: workload.googleapis.com/jvm.memory.pool.used
     -> Description: current memory pool usage
     -> Unit: by
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> name: STRING(Eden Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:38:25.913 +0000 UTC
Value: 0.000000
NumberDataPoints #1
Data point attributes:
     -> name: STRING(CodeHeap 'profiled nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:38:25.913 +0000 UTC
Value: 0.000000
NumberDataPoints #2
Data point attributes:
     -> name: STRING(Metaspace)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:38:25.913 +0000 UTC
Value: 0.000000
NumberDataPoints #3
Data point attributes:
     -> name: STRING(Survivor Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:38:25.913 +0000 UTC
Value: 0.000000
NumberDataPoints #4
Data point attributes:
     -> name: STRING(Tenured Gen)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:38:25.913 +0000 UTC
Value: 0.000000
NumberDataPoints #5
Data point attributes:
     -> name: STRING(CodeHeap 'non-nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:38:25.913 +0000 UTC
Value: 0.000000
......

2021-12-09T22:44:40.955Z    info    service/collector.go:146    Received signal from OS {"signal": "interrupt"}
2021-12-09T22:44:40.955Z    info    service/collector.go:242    Starting shutdown...
2021-12-09T22:44:40.955Z    info    service/service.go:121  Stopping receivers...
2021-12-09T22:44:40.978Z    INFO    loggingexporter/logging_exporter.go:56  MetricsExporter {"#metrics": 1}
2021-12-09T22:44:40.978Z    DEBUG   loggingexporter/logging_exporter.go:66  ResourceMetrics #0
Resource labels:
     -> service.name: STRING(unknown_service:java)
     -> telemetry.sdk.version: STRING(1.7.0)
     -> telemetry.sdk.language: STRING(java)
     -> telemetry.sdk.name: STRING(opentelemetry)
InstrumentationLibraryMetrics #0
InstrumentationLibrary  
Metric #0
Descriptor:
     -> Name: workload.googleapis.com/jvm.memory.pool.used
     -> Description: current memory pool usage
     -> Unit: by
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> name: STRING(Eden Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:44:40.963 +0000 UTC
Value: 2420528
NumberDataPoints #1
Data point attributes:
     -> name: STRING(CodeHeap 'profiled nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:44:40.963 +0000 UTC
Value: 2686720
NumberDataPoints #2
Data point attributes:
     -> name: STRING(Metaspace)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:44:40.963 +0000 UTC
Value: 6660568
NumberDataPoints #3
Data point attributes:
     -> name: STRING(Survivor Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:44:40.963 +0000 UTC
Value: 1966080
NumberDataPoints #4
Data point attributes:
     -> name: STRING(Tenured Gen)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:44:40.963 +0000 UTC
Value: 504768
NumberDataPoints #5
Data point attributes:
     -> name: STRING(CodeHeap 'non-nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:44:40.963 +0000 UTC
Value: 1212928
NumberDataPoints #6
Data point attributes:
     -> name: STRING(CodeHeap 'non-profiled nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:44:40.963 +0000 UTC
Value: 597504
NumberDataPoints #7
Data point attributes:
     -> name: STRING(Compressed Class Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-09 22:44:40.963 +0000 UTC
Value: 761120
sophieyfang commented 2 years ago

One way for the fix is to use so that all are in double for consistency to avoid conflicts

operation:
  - action: toggle_scalar_data_type

More info on this toggle_scalar_data_type processor: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstransformprocessor#toggle-datatype

djaglowski commented 2 years ago

@dehaansa @Mrod1598, likely there is an issue in the JMX Metric Gatherer that applies more broadly than just this one case. Will one of you please take this issue?

qingling128 commented 2 years ago

Yeah, we found a few other metrics that are ingesting DOUBLE while it's supposed to be INT.

djaglowski commented 2 years ago

Am I understanding correctly that this has only been observed when the first data point is 0.00?

sophieyfang commented 2 years ago

The problem is with metricstransform/jvm__pipeline_jvm__metrics_1

sophieyfang_google_com@debian10-jvm:~$ cat otel.yaml 
exporters:
  googlecloud:
    metric:
      prefix: ""
    user_agent: Google-Cloud-Ops-Agent-Metrics/latest (BuildDistro=build_distro;Platform=linux;ShortName=linux_platform;ShortVersion=linux_platform_version)
  logging:
    loglevel: debug
processors:
  filter/jvm:
    metrics:
      include:
        match_type: strict
        metric_names:
        - jvm.memory.pool.used
  resourcedetection/_global_0:
    detectors:
    - gce
  metricstransform/jvm__pipeline_jvm__metrics_1:
    transforms:
    - action: update
      include: ^(.*)$$
      match_type: regexp
      new_name: workload.googleapis.com/$${1}
  normalizesums/jvm__pipeline_jvm__metrics_0: {}
receivers:
  jmx/jvm__pipeline_jvm__metrics:
    collection_interval: 30s
    endpoint: service:jmx:rmi:///jndi/rmi://127.0.0.1:9010/jmxrmi
    jar_path: /opt/google-cloud-ops-agent/subagents/opentelemetry-collector/opentelemetry-java-contrib-jmx-metrics.jar
    target_system: jvm
service:
  pipelines:
    metrics/jvm__pipeline_jvm__metrics:
      exporters:
      - googlecloud
      - logging
      processors:
      - filter/jvm
      - metricstransform/jvm__pipeline_jvm__metrics_1 #with this line 0.0000, without this line no 0.00000
      receivers:
      - jmx/jvm__pipeline_jvm__metrics
sophieyfang commented 2 years ago

One way for the fix is to use so that all are in double for consistency to avoid conflicts

operation:
  - action: toggle_scalar_data_type

More info on this toggle_scalar_data_type processor: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/metricstransformprocessor#toggle-datatype

This method won't help as it's toggling int to double and double to int which result in:

2021-12-20T16:53:25.510Z    DEBUG   loggingexporter/logging_exporter.go:66  ResourceMetrics #0
Resource labels:
     -> service.name: STRING(unknown_service:java)
     -> telemetry.sdk.version: STRING(1.7.0)
     -> telemetry.sdk.language: STRING(java)
     -> telemetry.sdk.name: STRING(opentelemetry)
InstrumentationLibraryMetrics #0
InstrumentationLibrary  
Metric #0
Descriptor:
     -> Name: workload.googleapis.com/jvm.memory.pool.used
     -> Description: current memory pool usage
     -> Unit: by
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> name: STRING(Eden Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:25.505 +0000 UTC
Value: 0
NumberDataPoints #1
Data point attributes:
     -> name: STRING(CodeHeap 'profiled nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:25.505 +0000 UTC
Value: 0
NumberDataPoints #2
Data point attributes:
     -> name: STRING(Metaspace)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:25.505 +0000 UTC
Value: 0
NumberDataPoints #3
Data point attributes:
     -> name: STRING(Survivor Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:25.505 +0000 UTC
Value: 0
NumberDataPoints #4
Data point attributes:
     -> name: STRING(Tenured Gen)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:25.505 +0000 UTC
Value: 0
NumberDataPoints #5
Data point attributes:
     -> name: STRING(CodeHeap 'non-nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:25.505 +0000 UTC
Value: 0
NumberDataPoints #6
Data point attributes:
     -> name: STRING(CodeHeap 'non-profiled nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:25.505 +0000 UTC
Value: 0
NumberDataPoints #7
Data point attributes:
     -> name: STRING(Compressed Class Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:25.505 +0000 UTC
Value: 0

2021-12-20T16:53:30.081Z    INFO    loggingexporter/logging_exporter.go:56  MetricsExporter {"#metrics": 1}
2021-12-20T16:53:30.081Z    DEBUG   loggingexporter/logging_exporter.go:66  ResourceMetrics #0
Resource labels:
     -> service.name: STRING(unknown_service:java)
     -> telemetry.sdk.version: STRING(1.7.0)
     -> telemetry.sdk.language: STRING(java)
     -> telemetry.sdk.name: STRING(opentelemetry)
InstrumentationLibraryMetrics #0
InstrumentationLibrary  
Metric #0
Descriptor:
     -> Name: workload.googleapis.com/jvm.memory.pool.used
     -> Description: current memory pool usage
     -> Unit: by
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> name: STRING(Eden Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:30.071 +0000 UTC
Value: 515480.000000
NumberDataPoints #1
Data point attributes:
     -> name: STRING(CodeHeap 'profiled nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:30.071 +0000 UTC
Value: 2459520.000000
NumberDataPoints #2
Data point attributes:
     -> name: STRING(Metaspace)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:30.071 +0000 UTC
Value: 6651576.000000
NumberDataPoints #3
Data point attributes:
     -> name: STRING(Survivor Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:30.071 +0000 UTC
Value: 1966072.000000
NumberDataPoints #4
Data point attributes:
     -> name: STRING(Tenured Gen)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:30.071 +0000 UTC
Value: 519896.000000
NumberDataPoints #5
Data point attributes:
     -> name: STRING(CodeHeap 'non-nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:30.071 +0000 UTC
Value: 1213312.000000
NumberDataPoints #6
Data point attributes:
     -> name: STRING(CodeHeap 'non-profiled nmethods')
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:30.071 +0000 UTC
Value: 539136.000000
NumberDataPoints #7
Data point attributes:
     -> name: STRING(Compressed Class Space)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2021-12-20 16:53:30.071 +0000 UTC
Value: 766512.000000
sophieyfang commented 2 years ago

sorry for assigning the task to @quentinmit , for one second i thought it's agentmetricsprocessor owned by us, but it's actually metricsprocessor.

qingling128 commented 2 years ago

Per discussion offline, @dehaansa is looking into this now.

cpheps commented 2 years ago

Ok so I've been able to replicate the issue reliably using the above instructions. I found that the 1.7.0 version of the opentelemetry-java-contrib-jmx-metrics.jar (the one currently packaged with ops agent) exhibits this issue. The latest version 1.9.0 does not have this issues.

From what I can tell for JMX metrics that are not currently available 1.7.0 returns a nil value which ends up translating to a none-type metric in the metric transformer (here) instead of an Int. So the transformer assumes if it's a gauge metric and it doesn't have all int data points it must be a double, shown here.

In the 1.9.0 version it returns an int 0 for metrics that aren't available and so the gauge is reported as an int. I couldn't find a changelog entry that may have be the reason for this in the 1.7.0...1.9.0 changelog. My guess is it was something in the upgrade from opentelmetry-java 1.7.0 to 1.9.0 but nothing stood out to me from that changelog either.

@sophieyfang I would recommend updating to v1.9.0 of the jmx agent. I could not reproduce this issue at all with that version and could reliably do so with 1.7.0.

dehaansa commented 2 years ago

The JMX component should be being updated in https://github.com/GoogleCloudPlatform/ops-agent/pull/315 so we should be able to validate either in builds based on that PR or once it has been merged in.

quentinmit commented 2 years ago

From what I can tell for JMX metrics that are not currently available 1.7.0 returns a nil value which ends up translating to a none-type metric in the metric transformer (here) instead of an Int. So the transformer assumes if it's a gauge metric and it doesn't have all int data points it must be a double, shown here.

What does "not currently available" mean in this context? I don't understand how either none or 0 make sense; if the metric doesn't exist it doesn't exist. I'm worried that a 0 in a metric, whether it's a double or an int, could cause alerts to fire/graphs to be wrong.

cpheps commented 2 years ago

Yeah let me elaborate. The metric that caused this issue was related to garbage collection. When I connected jconsole to the JMX endpoint the metric in question showed its value as UNAVAILABLE. Eventually a value did appear. My guess is that the metric doesn't have a value until a GC occurs.

if the metric doesn't exist it doesn't exist. I'm worried that a 0 in a metric, whether it's a double or an int, could cause alerts to fire/graphs to be wrong.

I agree with this and it's likely a larger issue with how the java-contrib jmx collector works. I haven't dug in fully but from what I've observed the jmx collector will always report all the metrics it's suppose to collect even if they don't have a value.

So I think there's two larger issues:

  1. jmx jar reports metrics that don't have values (1.7.0 and 1.9.0 do so in different ways)
  2. The metrictransform processor turned all data points for a metric to 0.000 when it determined not all data points for the guage were ints.

My proposal is update the jmx jar now to mitigate this issue for now. I know it's not perfect solution but it's better than the current behavior. Also, we should file issues for the two points above. I'll take on writing up issues for those two items above unless someone feels like that's not the right course of action.

cpheps commented 2 years ago

Just to follow up on the 0 vs nil for no data. Looking at jconsole it also reports 0 for this metric during the times that it has no value. I still agree we should figure out the no data thing but 1.9.0 reporting a 0 in that instance at least is consistent with what a user would see in jconsole for this metric.

Screen Shot 2022-01-13 at 10 02 39 AM

jsuereth commented 2 years ago

This is a good discussion for our metric data model (cc @jmacd ).

Effectively if we don't have a known value, we should report "not a known value" vs. 0 or other magic number. Specifically we can model this in the latest OTLP - https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/metrics/v1/metrics.proto#L325

There's a bit-flag field on all number data points that can report "no value seen", and that's what this should be using. I don't think there's a way to hook this up in the API yet, but we should open a bug against BOTH the Java SDK + the Metrics Specification to allow this (cc @reyang)

quentinmit commented 2 years ago

Just to follow up on the 0 vs nil for no data. Looking at jconsole it also reports 0 for this metric during the times that it has no value. I still agree we should figure out the no data thing but 1.9.0 reporting a 0 in that instance at least is consistent with what a user would see in jconsole for this metric.

Does that suggest there's a distinction between "metric doesn't exist" and "metric exists but has no value" in JMX?

In the latter case we could use the metric's existence to create a metric descriptor, but without a value I'm not sure we know what data type to use. In the absence of that I think we would have to treat both cases as "metric doesn't exist".

reyang commented 2 years ago

@jsuereth thanks for adding me here. I don't have the full context but I guess the question for me is whether we should allow the API to report "no value seen", especially for asynchronous instruments.

My current answer is no. I think if there is no value, the API shouldn't report anything and here is one example/reason:

An IoT (Internet of Things) device is reporting room temperature (in Fahrenheit) from a house, normally the data would look like: {"room1": 68, "room2": 69, "room3": 72, "room4": 86}. Now the sensor in room4 has hardware failure, so the temperature information is not available anymore.

If we don't report "no value", we will end up having {"room1": 68, "room2": 69, "room3": 72}, and we can still deduce min, max, average, and we also have the information that room4 didn't provide data.

If we report "no value", we might end up with `{"room1": 68, "room2": 69, "room3": 72, "room4": NoValue/null/nil/NaN/etc.}, the value for room4 will poison all the aggregations and makes it harder to define what's the behavior (and whether there are options to tweak the behavior) (e.g. should "max" be 72 or Unknown/NaN?)

cpheps commented 2 years ago

Look like @dehaansa ran into the issue with the latest release of the jmx jar on a GCE instance. We could not reproduce the issue locally on our macs. I have a suspicion something is going wrong in the data transfer from jmx to collector-contrib. The issue is that one of the datapoints has a pdata.MetricValueType of nil` when it reaches the metric transform processor.

If I capture the logging output of the jmx collection straight from the jar the datapoint seems fine with a metric value of 0. I think something is going wrong in the IPC handoff of data. I'm going to continue investigating.

cpheps commented 2 years ago

Upon relooking into this issues I've found some weird behavior. I've follow the metric collection in a debugger from the jxm jar to the grpc export call and then picked it up on the Go side when the grpc receive handler is called. Here's what I saw:

  1. Upon collection the metric value in question reports a numeric value of -1
  2. The numeric value stays -1 all the way down to where it's serialized into bytes in the NumberDataPointMarshaler::writeTo
  3. On the Go side the gRPC Handler receives the request and the NumberDataPoint.value field is nil.

This causes the metric transform further down the pipeline to think the gauge is not all ints so it turns them into doubles.

I could not replicate this behavior with a v1.9.0+ jmx jar. I did notice the numeric value was still -1 in the jmx v1.9.0 jar but ended up being 0 when it came out of the gRPC server on the Go side.

I'm not actually sure what's causing this. I couldn't pinpoint exactly where things were changing. There are few more java calls beyond past what I looked. They handle the serialization to proto but the data is all bytes at that point it was hard to figure out what was going on.

sophieyfang commented 2 years ago

Verified that it's fixed in the ops-agent 2.9.0 which uses v1.9.0+ jmx jar

sophieyfang commented 2 years ago

Thanks @cpheps for investing deeply into this!!