aws-observability / aws-otel-lambda

AWS Distro for OpenTelemetry - AWS Lambda
https://aws-otel.github.io/
Apache License 2.0
146 stars 55 forks source link

ADOT Collector Dropping Exports in Lambda Environment #886

Open arun-annamalai opened 5 months ago

arun-annamalai commented 5 months ago

name: Bug report about: Create a report to help us improve title: 'ADOT Collector Dropping Exports in Lambda Environment' labels: bug assignees: ''


Describe the bug A clear and concise description of what the bug is. I have a manually instrumented Java Lambda with the ADOT otel lambda layer. I have the following setup

Java SDK -> ADOT Collector -> OpenSearch exporter -> OpenSearch Ingestion Pipeline

I expect to see all my spans being exported by the collector to the open search ingestion pipeline, but it appears the collector is being shutdown right after the function ends and I have seen around a 30% ratio in which the last span does not get exported.

I get the error

{"level":"error","ts":1711232220.1311545,"caller":"exporterhelper/common.go:49","msg":"Exporting failed. Dropping data. Try enabling sending_queue to survive temporary failures.","kind":"exporter","data_type":"traces","name":"otlphttp","dropped_items":1,"error":"request is cancelled or timed out failed to make an HTTP request: Post \"https://opensearch-pipeline-ozauvc3fcrr3dz6we2uophr43u.us-west-2.osis.amazonaws.com/entry-pipeline/v1/traces\": context canceled","stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.(*errorLoggingRequestSender).send \tgo.opentelemetry.io/collector/exporter@v0.90.1/exporterhelper/common.go:49 go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send \tgo.opentelemetry.io/collector/exporter@v0.90.1/exporterhelper/common.go:193 go.opentelemetry.io/collector/exporter/exporterhelper.NewTracesExporter.func1 \tgo.opentelemetry.io/collector/exporter@v0.90.1/exporterhelper/traces.go:98 go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces \tgo.opentelemetry.io/collector/consumer@v0.90.1/traces.go:25 go.opentelemetry.io/collector/internal/fanoutconsumer.(*tracesConsumer).ConsumeTraces \tgo.opentelemetry.io/collector@v0.90.1/internal/fanoutconsumer/traces.go:73 go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces \tgo.opentelemetry.io/collector/consumer@v0.90.1/traces.go:25 go.opentelemetry.io/collector/receiver/otlpreceiver/internal/trace.(*Receiver).Export \tgo.opentelemetry.io/collector/receiver/otlpreceiver@v0.90.1/internal/trace/otlp.go:41 go.opentelemetry.io/collector/pdata/ptrace/ptraceotlp.rawTracesServer.Export \tgo.opentelemetry.io/collector/pdata@v1.0.0/ptrace/ptraceotlp/grpc.go:89 go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/trace/v1._TraceService_Export_Handler.func1 \tgo.opentelemetry.io/collector/pdata@v1.0.0/internal/data/protogen/collector/trace/v1/trace_service.pb.go:310 go.opentelemetry.io/collector/config/configgrpc.(*GRPCServerSettings).toServerOption.enhanceWithClientInformation.func9 \tgo.opentelemetry.io/collector/config/configgrpc@v0.90.1/configgrpc.go:396 go.opentelemetry.io/collector/pdata/internal/data/protogen/collector/trace/v1._TraceService_Export_Handler \tgo.opentelemetry.io/collector/pdata@v1.0.0/internal/data/protogen/collector/trace/v1/trace_service.pb.go:312 google.golang.org/grpc.(*Server).processUnaryRPC \tgoogle.golang.org/grpc@v1.59.0/server.go:1343 google.golang.org/grpc.(*Server).handleStream \tgoogle.golang.org/grpc@v1.59.0/server.go:1737 google.golang.org/grpc.(*Server).serveStreams.func1.1 \tgoogle.golang.org/grpc@v1.59.0/server.go:986"}

Steps to reproduce If possible, provide a recipe for reproducing the error.

  1. Use the collector config attached below
  2. Create a ARM Java 11 Lambda with the otel latest lambda layer ARN: arn:aws:lambda:us-west-2:901920570463:layer:aws-otel-collector-arm64-ver-0-90-1:1
  3. Create 1 Trace and 1 span within that trace. The last line of the java function should be span.end()

What did you expect to see? A clear and concise description of what you expected to see. I expect to see all spans being exported to the oltphttp endpoint of the open search ingestion pipeline.

What did you see instead? A clear and concise description of what you saw instead. I saw the last span being dropped.

What version of collector/language SDK version did you use? Version: (e.g., v0.58.0, v1.11.0, etc) Collector Lambda Layer: arn:aws:lambda:us-west-2:901920570463:layer:aws-otel-collector-arm64-ver-0-90-1:1

What language layer did you use? Config: (e.g., Java, Python, etc) Java

Cloudwatch Logs log-events-viewer-result.csv

Collector Config

extensions:
  sigv4auth:
    region: "us-west-2"
    service: "osis"

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "localhost:4317"
      http:
        endpoint: "localhost:4318"

exporters:
  logging:
  awsxray:
  otlphttp:
    traces_endpoint: "https://opensearch-pipeline-ozauvc3fcrr3dz6we2uophr43u.us-west-2.osis.amazonaws.com/entry-pipeline/v1/traces"
    auth:
      authenticator: sigv4auth
    compression: none

service:
  extensions: [sigv4auth]
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [awsxray, otlphttp]
    metrics:
      receivers: [otlp]
      exporters: [logging]
  telemetry:
    metrics:
      address: localhost:8888

Additional context Adding a 1 second sleep before my lambda exits solves the problem, but shouldnt the lambda environment design make sure to flush all spans within the collector before the collector shuts down?

Dr-Emann commented 4 months ago

Related to #787, I believe

meijeran commented 4 months ago

What you could do it act on the SIGTERM signal instead of adding a sleep, an example on how to use that can be found here: https://github.com/aws-samples/graceful-shutdown-with-aws-lambda/tree/main/java-demo

arun-annamalai commented 4 months ago

ah I see, is there a way to force flush all spans from the collector given the SIGTERM signal? Because from the applications side, it seems that all spans are exported to the collector

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled