GoogleCloudPlatform / opentelemetry-operations-go

Apache License 2.0
131 stars 99 forks source link

OTEL Exporter maps Prometheus metrics to generic_task monitored resource #434

Closed jsirianni closed 1 year ago

jsirianni commented 2 years ago

Problem

I am scraping Prometheus metrics (Aerospike exporter) using the Prometheus receiver. The google exporter is mapping these metrics to generic_task monitored resource type when I would expect it to map to generic_node.

This is because Prometheus exporter sets the following resource labels

Resource labels (From logging exporter)

Resource labels:
     -> service.name: STRING(aerospike)
     -> service.instance.id: STRING(127.0.0.1:9145)
     -> net.host.port: STRING(9145)
     -> http.scheme: STRING(http)
     -> host.name: STRING(fedora)
     -> os.type: STRING(linux)

Screenshot from 2022-06-20 17-36-17

Expected Behavior

I would expect to map Prometheus metrics to generic_node. The description for Generic Task does not match Prometheus metrics:

Description: A generic task identifies an application process for which no more specific resource is applicable, such as a process scheduled by a custom orchestration system. The label values must uniquely identify the task.

The description for generic_node makes more sense in my opinion:

Description: A generic node identifies a machine or other computational resource for which no more specific resource type is applicable. The label values must uniquely identify the node.

I have considered renaming service.name and service.instance.id, however, Open Telemetry Semantic Conventions encourage their use.

Environment

Collector: opentelemetry-collector-contrib.jsirianni main branch and v0.53.0

Configuration:

receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: 'aerospike'
        scrape_interval: 30s
        static_configs:
        - targets:
          - '127.0.0.1:9145'

processors:
  resourcedetection/system:
    detectors: ["system"]
    system:
      hostname_sources: ["os"]

  batch:

exporters: 
  googlecloud:
    metric:
      prefix: workload.googleapis.com/aerospike
      resource_filters:
        - prefix: service.*

  logging:
    loglevel: debug

service:
  pipelines:
    metrics:
      receivers:
      - prometheus
      processors:
      - resourcedetection/system
      - batch
      exporters:
      - googlecloud
      - logging

Mapping code

        genericTask: {
            location: {
                otelKeys: []string{
                    string(semconv.CloudAvailabilityZoneKey),
                    string(semconv.CloudRegionKey),
                },
                fallbackLiteral: "global",
            },
            namespace: {otelKeys: []string{string(semconv.ServiceNamespaceKey)}},
            job:       {otelKeys: []string{string(semconv.ServiceNameKey)}},
            taskID:    {otelKeys: []string{string(semconv.ServiceInstanceIDKey)}},
        },
        genericNode: {
            location: {
                otelKeys: []string{
                    string(semconv.CloudAvailabilityZoneKey),
                    string(semconv.CloudRegionKey),
                },
                fallbackLiteral: "global",
            },
            namespace: {otelKeys: []string{string(semconv.ServiceNamespaceKey)}},
            nodeID:    {otelKeys: []string{string(semconv.HostIDKey), string(semconv.HostNameKey)}},
        },
dashpole commented 2 years ago

What platform (GCE (VMs), GKE, cloud run, etc.) are you running on?

I would highly encourage you to use the gcp resource detector. It is designed to detect the resource attributes required to map to the correct monitored resource based on the platform. Neither generic_node or generic_task are ideal.

jsirianni commented 2 years ago

Outside of GCP. We are supporting customers who are migrating to GCP and have workloads running onpremis and in other clouds. The Google team has always instructed us to send to workload apis + generic_node.

dashpole commented 2 years ago

Ah, I would recommend removing the service instance id attribute (with the resource processor) in that case. It isn't as useful since it is a local target.

jsirianni commented 2 years ago

It would be nice to keep it considering it is called out in the semantic conventions and expected by users using the Prometheus receiver. https://opentelemetry.io/docs/reference/specification/resource/semantic_conventions/#service.

Would it be possible to map to generic_node if host.name is set? It seems like the exporter could check for generic_node first, as host.name (node_id) is not used by generic_task.

dashpole commented 2 years ago

When using the prometheus server to monitor non-local applications, i'd argue generic_task is the right monitored resource (since we don't know the host name of the application we are scraping). But for local prometheus targets, you are right that generic_node is preferable.

If we changed the order, that would be breaking for users, so i'd like to avoid doing that if possible. Another option for you would be to use resource_filters to match net.host.port. That would provide the important piece of information from the instance id you are using. Alternatively, we could add configuration to allow influencing monitored resource selection via config.

jsirianni commented 2 years ago

In this case, service.instance.id is the hostname:port of the instance being scraped. In my example I am scraping local host but I could scrape one or many remote IP / hostnames.

Would it be appropriate to rename service.instance.id to host.name?

dashpole commented 2 years ago

Ideally, it would be set to <host IP>:<port> to be closer to what you would get from a prometheus server scraping it, but <host.name>:<port>, or <host.id> are also reasonable things to do.

jsuereth commented 2 years ago

I'd like to classify this issue as either bug/enhancement/wontfix etc.

From what I can tell on this discussion:

  1. The behavior today was intentional. In our view generic_task fits the description of a prometheus exporter
  2. There's a desire to use generic_node of a particular node when scraping localhost metric exporters. I believe this is a feature-request.

As such I'm going to mark this as a feature request for #2. Feel free to update classification if this is wrong.

dashpole commented 1 year ago

Sorry, I had formatting errors above. I think overriding service.instance.id with the host IP + port or host name + port is a better option than sending to generic_node, and should be possible with the transform processor.