aws-observability / aws-otel-collector

AWS Distro for OpenTelemetry Collector (see ADOT Roadmap at https://github.com/orgs/aws-observability/projects/4)
https://aws-otel.github.io/
Other
591 stars 240 forks source link

No metrics pushed from the sidecar collector on ECS Fargate to Amazon Managed Prometheus #2493

Closed mmorellareply closed 11 months ago

mmorellareply commented 1 year ago

Describe the question Our objective is to instrument our python code, residing in a task in an ECS Fargate cluster, and send custom metrics to Grafana through Prometheus. We've set up an ECS task with an aws-otel-collector sidecar. No error logs are present on cloudwatch, nor metrics are getting pushed from the collector to the Amazon Managed Prometheus instance we have. We are seeking a solution or hints regarding wheter any mistake was made either in the task definition, config file or python code. Thank you!

Steps to reproduce if your question is related to an action Create an ECS Fargate cluster, a task, a AMP instance, and run the container on a cluster. The task definition defines and pushes custom metrics to send to prometheus.

What did you expect to see? I expected the aws-otel-collector to be able to scrape metrics that are exposed through the python code and send them to Amazon Managed Prometheus.

Environment The following is the task definition:

{
    "taskDefinitionArn": "**XXX**",
    "containerDefinitions": [
        {
            "name": "task_template-container",
            "image": "XXX",
            "cpu": 0,
            "portMappings": [
                {
                    "name": "metrics-source-8080-tcp",
                    "containerPort": 8080,
                    "hostPort": 8080,
                    "protocol": "udp",
                    "appProtocol": "http"
                }
            ],
            "essential": true,
            "mountPoints": [],
            "volumesFrom": [],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-create-group": "true",
                    "awslogs-group": "/ecs/TaskTemplateADOT",
                    "awslogs-region": "eu-west-2",
                    "awslogs-stream-prefix": "ecs"
                }
            }
        },
        {
            "name": "aws-otel-collector",
            "image": "public.ecr.aws/aws-observability/aws-otel-collector:v0.35.0",
            "cpu": 0,
            "portMappings": [],
            "essential": true,
            "command": [
                "--config=/etc/ecs/ecs-amp-prometheus.yaml"
            ],
            "environment": [
                {
                    "name": "AWS_PROMETHEUS_SCRAPING_ENDPOINT",
                    "value": "127.0.0.1:8080"
                },
                {
                    "name": "AWS_PROMETHEUS_ENDPOINT",
                    "value": "https://aps-workspaces.eu-west-2.amazonaws.com/workspaces/XXX/api/v1/remote_write"
                }
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-create-group": "true",
                    "awslogs-group": "/ecs/ecs-aws-otel-sidecar-collector",
                    "awslogs-region": "eu-west-2",
                    "awslogs-stream-prefix": "ecs"
                }
            }
        }
    ],
    "family": "TaskTemplateADOT",
    "executionRoleArn": "arn:aws:iam::XXX:role/XXX",
    "networkMode": "awsvpc",
    "revision": 32,
    "volumes": [],
    "status": "ACTIVE",
    "requiresAttributes": [
        {
            "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
        },
        {
            "name": "ecs.capability.execution-role-awslogs"
        },
        {
            "name": "com.amazonaws.ecs.capability.ecr-auth"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
        },
        {
            "name": "ecs.capability.execution-role-ecr-pull"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
        },
        {
            "name": "ecs.capability.task-eni"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.29"
        }
    ],
    "placementConstraints": [],
    "compatibilities": [
        "EC2",
        "FARGATE"
    ],
    "requiresCompatibilities": [
        "FARGATE"
    ],
    "cpu": "1024",
    "memory": "3072",
    "runtimePlatform": {
        "cpuArchitecture": "X86_64",
        "operatingSystemFamily": "LINUX"
    },
    "registeredAt": "2023-11-20T15:33:38.071Z",
    "registeredBy": "XXX",
    "tags": []
}

The main.py is structured as follow:


import random
import time 

from opentelemetry import propagate, trace, metrics
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.resources import SERVICE_NAME, Resource

from prometheus_client import start_http_server
from opentelemetry.exporter.prometheus import PrometheusMetricReader

resource = Resource(attributes={
    SERVICE_NAME: "Task-test-prometheus"
})

start_http_server(port=8080, addr="0.0.0.0")

reader = PrometheusMetricReader()

provider = MeterProvider(resource=resource, metric_readers=[reader])
metrics.set_meter_provider(provider)

meter = metrics.get_meter(__name__)

extracted_stuff=meter.create_counter(
    name="ExctractedStuff",
    description="Number of exctracted stuff",
    unit='1'
)
common_attributes = { 'signal': 'metric', 'language': 'python-manual-instrumentation', 'metricType': 'request' }

extracted_stuff.add(random.randint(6,70), attributes=common_attributes)
extracted_stuff.add(random.randint(6,70), attributes=common_attributes)
time.sleep(10)
extracted_stuff.add(random.randint(6,70), attributes=common_attributes)
time.sleep(0.5)
extracted_stuff.add(random.randint(6,70), attributes=common_attributes)
time.sleep(0.5)
extracted_stuff.add(random.randint(6,70), attributes=common_attributes)
time.sleep(0.5)
extracted_stuff.add(random.randint(6,70), attributes=common_attributes)
time.sleep(5)

Checking through a local run of the code, I can curl the endpoint localhost:8080 and retrieve the metrics:

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 935.0
python_gc_objects_collected_total{generation="1"} 345.0
python_gc_objects_collected_total{generation="2"} 0.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 196.0
python_gc_collections_total{generation="1"} 17.0
python_gc_collections_total{generation="2"} 1.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="11",patchlevel="4",version="3.11.4"} 1.0
# HELP exctractedstuff_1_total Number of exctracted stuff
# TYPE exctractedstuff_1_total counter
exctractedstuff_1_total{language="python-manual-instrumentation",metricType="request",signal="metric"} 129.0

Additional context We've previously tried to push metrics to cloudwatch through the EMF exporter without avail.

We also tried with a custom configuration file through SSM Parameter Store:

extensions:
  health_check:
  sigv4auth:
    region: "eu-west-2"
    service: aps
    assume_role:
      arn: "arn:aws:iam::XXX:role/XXX"
      sts_region: "eu-west-2"

receivers:
  awsecscontainermetrics:
  prometheus:
    config:
      global:
        scrape_interval: 20s
        scrape_timeout: 10s
      scrape_configs:
        - job_name: "otel-collector"
          static_configs:
            - targets: ["127.0.0.1:8080"]

processors:
  batch/metrics:
    timeout: 60s
  resourcedetection:
    detectors:
      - env
      - system
      - ecs
      - ec2
  filter:
    metrics:

exporters:
  prometheusremotewrite:
    endpoint: "https://aps-workspaces.eu-west-2.amazonaws.com/workspaces/XXX/api/v1/remote_write"
    auth:
      authenticator: sigv4auth
    resource_to_telemetry_conversion:
      enabled: true
  logging:
    loglevel: info

service:
  pipelines:
    metrics/application:
      receivers: [prometheus]
      processors: [resourcedetection, batch/metrics]
      exporters: [logging,prometheusremotewrite]
    metrics:
      receivers: [awsecscontainermetrics]
      processors: [filter]
      exporters: [logging,prometheusremotewrite]
  extensions: [health_check, sigv4auth]
vsakaram commented 1 year ago

In case you have a enterprise support with AWS, can you please cut a ticket to us through that channel regarding this issue please?

Coombszy commented 8 months ago

@mmorellareply You marked the issue has closed, Did you figure out what the cause was?