aws-observability / aws-otel-community

Welcome to the AWS Distro for OpenTelemetry project. If you're using monitoring and observability tools for AWS products and services, this is a great place to ask questions, request features and network with other community members.
https://aws-otel.github.io/
Apache License 2.0
93 stars 93 forks source link

ADOT Collector/instrumentation not creating X-Ray spans on ECS Fargate, NodeJS app #946

Open AA-morganh opened 7 months ago

AA-morganh commented 7 months ago

Hi, I'm having an issue with ADOT on ECS Fargate. I'm seeing cloudwatch logs, metrics, and container insights metrics, as well as some start-up FS spans in X-RAY, but I'm not getting any application spans in X-Ray. My auto instrumentation code is as folows:

/*instrumentation.ts*/
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-proto';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-proto';
import { diag, DiagConsoleLogger, DiagLogLevel } from '@opentelemetry/api';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { AWSXRayPropagator } from '@opentelemetry/propagator-aws-xray';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { AWSXRayIdGenerator } from '@opentelemetry/id-generator-aws-xray';

if (!process.env.DISABLE_TELEMETRY) {
  // For troubleshooting, set the log level to DiagLogLevel.DEBUG
  diag.setLogger(new DiagConsoleLogger(), DiagLogLevel.INFO);

  const traceExporter = process.env.OTLP_COLLECTOR_TRACE_URL
    ? new OTLPTraceExporter({
        url: process.env.OTLP_COLLECTOR_TRACE_URL,
      })
    : new OTLPTraceExporter({ url: 'http://127.0.0.1:4318/v1/traces' });

  const metricReader = new PeriodicExportingMetricReader({
    exporter: process.env.OTLP_COLLECTOR_METRICS_URL
      ? new OTLPMetricExporter({
          url: process.env.OTLP_COLLECTOR_METRICS_URL,
        })
      : new OTLPMetricExporter({ url: 'http://127.0.0.1:4318/v1/metrics' }),
  });

  const spanProcessor = new BatchSpanProcessor(traceExporter);

  const sdk = new NodeSDK({
    textMapPropagator: new AWSXRayPropagator(),
    traceExporter: traceExporter,
    metricReader: metricReader,
    spanProcessor: spanProcessor,
    idGenerator: new AWSXRayIdGenerator(),
    instrumentations: [getNodeAutoInstrumentations()],
    resource: new Resource({
      [SemanticResourceAttributes.SERVICE_NAME]: 'MyService',
      [SemanticResourceAttributes.SERVICE_VERSION]: '1.0',
    }),
  });

  sdk.start();

  process.on('SIGTERM', () => {
    sdk
      .shutdown()
      .then(() => console.log('Tracing and Metrics terminated'))
      .catch((error) => console.log('Error terminating tracing and metrics', error))
      .finally(() => process.exit(0));
  });
}

export default {};

My TaskDef looks like this (my CI replaces a bunch of tokens in here):

{
  "family": "myService",
  "containerDefinitions": [
    {
      "name": "myService",
      "image": "REPLACE_REPOSITORY_URI:REPLACE_IMAGE_TAG",
      "healthCheck": {
            "command": ["CMD-SHELL", "wget -q -S -O - localhost:8080/healthcheck"],
            "interval": 5,
            "retries": 10,
            "timeout": 3
      },
      "portMappings": [
        {
            "containerPort": 8080,
            "hostPort": 8080,
            "protocol": "tcp"
        }
      ],
      "essential": true,
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
            "awslogs-group": "/ecs/REPLACE_STAGE-myService",
            "awslogs-region": "REPLACE_AWS_REGION",
            "awslogs-stream-prefix": "ecs"
        },
        "secretOptions": []
      },
      "dependsOn": [{
        "containerName": "aws-otel-collector",
        "condition": "HEALTHY"
      }],
      "environment": [
                {
                  "name": "ACCOUNT_ID",
                  "value": "REPLACE_AWS_ACCOUNT_ID"
                },
                {
                  "name": "REGION",
                  "value": "REPLACE_AWS_REGION"
                },
                {
                  "name": "STAGE",
                  "value": "REPLACE_STAGE"
                },
                {
                  "name": "NO_COLOR",
                  "value": "NO_COLOR"
                },
                {
                  "name": "LatestSchema",
                  "value": "REPLACE_LATEST_SCHEMA"
                },
                {
                  "name": "JWT_SECRET",
                  "value": "REPLACE_SECRET_ARN"
                }
        ]
    },
    {
      "name": "aws-otel-collector",
      "image": "REPLACE_AWS_ACCOUNT_ID.dkr.ecr.REPLACE_AWS_REGION.amazonaws.com/ecr-public/aws-observability/aws-otel-collector:latest",
      "essential": true,
      "command": [
                "--set=service.telemetry.logs.level=DEBUG", "--config=/etc/ecs/container-insights/otel-task-metrics-config.yaml"
      ],
      "user": "0:0",
      "healthCheck": {
            "command": ["/healthcheck"],
            "interval": 5,
            "retries": 10,
            "timeout": 3
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
            "awslogs-group": "/ecs/REPLACE_STAGE-aws-otel-sidecar-collector",
            "awslogs-region": "REPLACE_AWS_REGION",
            "awslogs-stream-prefix": "ecs"
        },
        "secretOptions": []
      }
    },
    {
      "name": "aws-otel-emitter",
      "image": "REPLACE_AWS_ACCOUNT_ID.dkr.ecr.REPLACE_AWS_REGION.amazonaws.com/ecr-public/aws-otel-test/aws-otel-goxray-sample-app:latest",
      "essential": false,
      "healthCheck": {
            "command": ["CMD-SHELL", "curl -f http://localhost:5000 || exit 1"],
            "interval": 5,
            "retries": 10,
            "timeout": 3
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
            "awslogs-group": "/ecs/REPLACE_STAGE-aws-otel-sidecar-emitter",
            "awslogs-region": "REPLACE_AWS_REGION",
            "awslogs-stream-prefix": "ecs"
        },
        "secretOptions": []
      }
    }
  ],
  "taskRoleArn": "REPLACE_TASK_ROLE_ARN",
  "executionRoleArn": "REPLACE_EXECUTION_ROLE_ARN",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512"
}

I have verbose logging enabled on both the sdk and on the collector, and I'm not seeing anything that looks suspicious to me, other than that I don't see my expected automatic or manual spans.

On a local docker-compose setup with a simple mainline otel collector I do see my spans making it to a grafana/tempo instance, so I think the instrumentation is largely set up correctly. Any guidance would be a huge help.

bmxpiku commented 5 months ago

I think we're facing the same issue right now, something changed and broke collector in some minor version update for node

EDIT: This is due to us moving forward and migrating whole project to ESM, downgrade to 18.16 and using experimental flag fixed it

# https://gajus.com/blog/how-to-add-sentry-tracing-to-your-node-js-app#nodejs-esm-modules
# https://github.com/open-telemetry/opentelemetry-js/issues/4392
# https://github.com/open-telemetry/opentelemetry-js/issues/4547
# https://github.com/open-telemetry/opentelemetry-js/issues/4553
CMD ["node", "--experimental-loader=@opentelemetry/instrumentation/hook.mjs", "dist/index.js"]

Though I talked with my team and we wont do it for production build, so I think we gotta live with no x-ray for a time