aws-observability / aws-otel-collector

AWS Distro for OpenTelemetry Collector (see ADOT Roadmap at https://github.com/orgs/aws-observability/projects/4)
https://aws-otel.github.io/
Other
573 stars 239 forks source link

awsprometheusremotewrite exporter logs - exemplar missing labels #959

Closed gautam-nutalapati closed 2 years ago

gautam-nutalapati commented 2 years ago

Describe the question Why does awsprometheusremotewrite exporter in aws-otel-collector throw below error:

"error": "Permanent error: remote write returned HTTP status 400 Bad Request; err = <nil>: exemplar missing labels, timestamp: 1643818281979 series: {__name__=\"http_client_duration_bucket\", http_flavor=\"1.1\", http_method=\"GET\", http_status_code=\"200\", http_url=\"http://169.254.170.2/v2/credentials/5f993586-e2c0-4a1d-91d0-e48ba719e22a\", le=\"5\"} la"

Steps to reproduce if your question is related to an action

Environment NA

Additional context It seems like this error is thrown by prometheus when trace information is not tied to metrics. e.g. metric with exemplar information from link: my_histogram_bucket{le="0.5"} 205 # {TraceID="b94cc547624c3062e17d743db422210e"} 0.175XXX 1.6XXX

Can this error be ignored? Or am I missing any configuration which is causing this error. I cannot find much info online about this. I don't need trace to be tied to metric.

OTEL-Collector configuration:

extensions:
  health_check:
  pprof:
    endpoint: :1777
  zpages:
    endpoint: :55679

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  # https://github.com/open-telemetry/opentelemetry-collector/blob/main/processor/memorylimiterprocessor/README.md
  memory_limiter:
    check_interval: 1s
    limit_percentage: 50
    spike_limit_percentage: 30
  batch/traces:
    timeout: 10s
    send_batch_size: 50
  batch/metrics:
    timeout: 10s

exporters:
  awsxray:
    region: "${AWS_REGION}"
  awsprometheusremotewrite:
    endpoint: "${PROMETHEUS_WRITE_ENDPOINT}"
    aws_auth:
      service: "aps"
      region: "${AWS_REGION}"
  prometheus:
    endpoint: "0.0.0.0:8889"
service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch/traces]
      exporters: [awsxray]
    metrics:
      receivers: [otlp]
      processors: [batch/metrics]
      exporters: [awsprometheusremotewrite]
    # Pipeline to send metrics to local prometheus workspace
    metrics/2:
      receivers: [otlp]
      processors: [batch/metrics]
      exporters: [ prometheus ]

Update: I have been running the metric forwarded despite this error to test it out more. This error seems to have issue exporting just one bucket of all buckets. I configured aws otel collector to forward metrics to both prometheus and prometheusremotewriteexporter. In prometheus endpoint exposed by aws-otel-collector, I see below data for the histogram:

api_latency_bucket{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc",le="5"} 0
api_latency_bucket{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc",le="10"} 0
api_latency_bucket{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc",le="25"} 0
api_latency_bucket{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc",le="50"} 0
api_latency_bucket{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc",le="75"} 0
api_latency_bucket{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc",le="100"} 0
api_latency_bucket{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc",le="250"} 0
api_latency_bucket{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc",le="500"} 0
api_latency_bucket{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc",le="750"} 0
api_latency_bucket{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc",le="1000"} 0
api_latency_bucket{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc",le="2500"} 43
api_latency_bucket{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc",le="5000"} 44
api_latency_bucket{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc",le="7500"} 44
api_latency_bucket{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc",le="10000"} 44
api_latency_bucket{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc",le="+Inf"} 44
api_latency_sum{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc"} 70168
api_latency_count{api_method="GET",api_name="/users/v1/profiles/me",env="dev-local",status_code="500",svc="user-profile-svc"} 44

But in grafana, the histogram I plot looks as below,

grafana-missing-bucket

As we can see, AMP is missing a bucket data. Related error shows data being dropped for this bucket:

2022-02-07T21:33:26.114Z    error   exporterhelper/queued_retry.go:183  Exporting failed. The error is not retryable. Dropping data.    {"kind": "exporter", "name": "awsprometheusremotewrite", "error": "Permanent error: remote write returned HTTP status 400 Bad Request; err = <nil>: exemplar missing labels, timestamp: 1644267527143 series: {__name__=\"api_latency_bucket\", api_method=\"GET\", api_name=\"/users/v1/profiles/me\", env=\"dev-local\", le=\"2500\", status_code=\"500\", svc=\"user-profile-svc\", test=\"gautam\"} labels: {}\n", "dropped_items": 39}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
    go.opentelemetry.io/collector@v0.43.1/exporter/exporterhelper/queued_retry.go:183
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
    go.opentelemetry.io/collector@v0.43.1/exporter/exporterhelper/metrics.go:134
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
    go.opentelemetry.io/collector@v0.43.1/exporter/exporterhelper/queued_retry_inmemory.go:105
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
    go.opentelemetry.io/collector@v0.43.1/exporter/exporterhelper/internal/bounded_memory_queue.go:99
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
    go.opentelemetry.io/collector@v0.43.1/exporter/exporterhelper/internal/bounded_memory_queue.go:78

In addition, below metrics which are published to prometheus but are being dropped when writing to AMP. These are default http metrics generated by aws otel java agent: http_client_duration_bucket and http_server_duration_bucket

2022-02-07T21:38:25.801Z    error   exporterhelper/queued_retry.go:183  Exporting failed. The error is not retryable. Dropping data.    {"kind": "exporter", "name": "awsprometheusremotewrite", "error": "Permanent error: remote write returned HTTP status 400 Bad Request; err = <nil>: exemplar missing labels, timestamp: 1644267527142 series: {__name__=\"http_client_duration_bucket\", env=\"gautam-dev\", http_flavor=\"1.1\", http_method=\"GET\", http_url=\"http://localhost:9900/stux/v1/users/97378103842048256\", le=\"5\", svc=\"user-profile-svc\", tes", "dropped_items": 39}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
    go.opentelemetry.io/collector@v0.43.1/exporter/exporterhelper/queued_retry.go:183
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
    go.opentelemetry.io/collector@v0.43.1/exporter/exporterhelper/metrics.go:134
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
    go.opentelemetry.io/collector@v0.43.1/exporter/exporterhelper/queued_retry_inmemory.go:105
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
    go.opentelemetry.io/collector@v0.43.1/exporter/exporterhelper/internal/bounded_memory_queue.go:99
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
    go.opentelemetry.io/collector@v0.43.1/exporter/exporterhelper/internal/bounded_memory_queue.go:78
2022-02-07T21:39:25.868Z    error   exporterhelper/queued_retry.go:183  Exporting failed. The error is not retryable. Dropping data.    {"kind": "exporter", "name": "awsprometheusremotewrite", "error": "Permanent error: remote write returned HTTP status 400 Bad Request; err = <nil>: exemplar missing labels, timestamp: 1644265641009 series: {__name__=\"http_server_duration_bucket\", env=\"gautam-dev\", http_flavor=\"1.1\", http_host=\"localhost:8060\", http_method=\"GET\", http_scheme=\"http\", http_status_code=\"403\", le=\"750\", svc=\"user-profile-s", "dropped_items": 39}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
    go.opentelemetry.io/collector@v0.43.1/exporter/exporterhelper/queued_retry.go:183
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
    go.opentelemetry.io/collector@v0.43.1/exporter/exporterhelper/metrics.go:134
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
    go.opentelemetry.io/collector@v0.43.1/exporter/exporterhelper/queued_retry_inmemory.go:105
go.opentelemetry.io/collector/exporter/exporterhelper/internal.consumerFunc.consume
    go.opentelemetry.io/collector@v0.43.1/exporter/exporterhelper/internal/bounded_memory_queue.go:99
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func2
    go.opentelemetry.io/collector@v0.43.1/exporter/exporterhelper/internal/bounded_memory_queue.go:78
sethAmazon commented 2 years ago

Does your cortex end point end with /api/v1/remote_write?

gautam-nutalapati commented 2 years ago

Yes it does, I pass below to docker run: -e PROMETHEUS_WRITE_ENDPOINT=https://aps-workspaces.us-east-1.amazonaws.com/workspaces/ws-xxx/api/v1/remote_write

gautam-nutalapati commented 2 years ago

I am trying to reproduce this via prometheusremotewriteexporter. This looks related to opentelemetry-collector-contrib#5578

sethAmazon commented 2 years ago

If this issue is also in prometheusremotewriteexporter. we should create an issue in the contrib and link this issue. https://github.com/open-telemetry/opentelemetry-collector-contrib.

sethAmazon commented 2 years ago

Can you turn on debug logs and post please.

gautam-nutalapati commented 2 years ago

Iogs of aws-otel-collector I created a opentelemetry-demo-app to reproduce the issue locally. Please let me know if this issue should be moved to opentelemetry-collector-contrib, I am not able to figure out where the root cause is coming from.

sethAmazon commented 2 years ago

If the issue is in both the aws and non aws writer it should be in the contrib as this will no longer be an aws specific issue imo.

gautam-nutalapati commented 2 years ago

@sethAmazon Is it possible to write to Amazon managed prometheus using prometheusremotewriteexporter? If not, I cannot reproduce this issue with prometheusremotewriteexporter.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions[bot] commented 2 years ago

This issue was closed because it has been marked as stall for 30 days with no activity.