elastic / apm

Elastic Application Performance Monitoring - resources and general issue tracking for Elastic APM.
https://www.elastic.co/apm
Apache License 2.0
374 stars 111 forks source link

APM UI doesn't show connected traces sent from AWS Distro for OTel (Lambda) #458

Open michaelhyatt opened 3 years ago

michaelhyatt commented 3 years ago

APM Server version (apm-server version): 7.13.1

Description of the problem including expected versus actual behavior: I managed to get the distributed tracing going with AWS X-Ray and the OTel collector to pick it up. The flow is curl => API-GW => producer-lambda => API-GW => consumer-lambda. The code (serverless): https://github.com/michaelhyatt/serverless-aws-python-api-worker-otel

The interesting thing is that we are recording the 2 transactions with 1 span correctly as having the same trace.id, however, the APM UI doesn’t really show a proper trace.

AWS X-Ray showing the trace with 2 components: image

Discover shows 3 records, 2 transactions and 1 span with the same trace.id: image

The APM UI is showing 2 services, but no traces: image

Steps to reproduce: As per README, requires the serverless framework.

JSON documents ### Consumer transaction ```json { "_index": "apm-7.13.1-transaction-000001", "_type": "_doc", "_id": "TdX7OHoBmWidblu2OYlN", "_version": 1, "_score": null, "fields": { "transaction.name.text": ["handler.consumer"], "service.framework.version": ["0.16.dev0"], "observer.name": ["instance-0000000008"], "labels.faas_name": ["aws-python-api-worker-project-dev-consumer"], "service.language.name": ["python"], "labels.faas_id": ["arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-consumer"], "transaction.sampled": [true], "transaction.id": ["1b5693f6e07de9c7"], "trace.id": ["60d331f41b8d383766d158e55359aaca"], "processor.event": ["transaction"], "agent.name": ["opentelemetry/python"], "labels.faas_execution": ["c7ab9723-9e43-4488-89d0-e1759190b984"], "event.outcome": ["unknown"], "cloud.region": ["ap-southeast-2"], "service.name": ["aws-python-api-worker-project-dev-consumer"], "service.framework.name": ["opentelemetry.instrumentation.aws_lambda"], "processor.name": ["transaction"], "transaction.duration.us": [698], "labels.faas_version": ["$LATEST"], "observer.version_major": [7], "observer.hostname": ["60e5fd4e9651"], "transaction.type": ["custom"], "cloud.provider": ["aws"], "event.ingested": ["2021-06-23T13:07:06.188Z"], "timestamp.us": [1624453624947661], "@timestamp": ["2021-06-23T13:07:04.947Z"], "ecs.version": ["1.8.0"], "observer.type": ["apm-server"], "observer.version": ["7.13.1"], "agent.version": ["1.2.0"], "parent.id": ["8ae9f83b97976607"], "transaction.name": ["handler.consumer"] }, "highlight": { "trace.id": ["@kibana-highlighted-field@60d331f41b8d383766d158e55359aaca@/kibana-highlighted-field@"] }, "sort": [1624453624947] } ``` ### Producer HTTP span ```json { "_index": "apm-7.13.1-span-000001", "_type": "_doc", "_id": "T9X7OHoBmWidblu2OYlN", "_version": 1, "_score": null, "fields": { "span.destination.service.type": ["external"], "span.name": ["HTTP GET"], "destination.port": [443], "observer.name": ["instance-0000000008"], "labels.faas_name": ["aws-python-api-worker-project-dev-producer"], "trace.id": ["60d331f41b8d383766d158e55359aaca"], "span.duration.us": [2256175], "processor.event": ["span"], "agent.name": ["opentelemetry/python"], "destination.address": ["3y00jobl0c.execute-api.ap-southeast-2.amazonaws.com"], "event.outcome": ["success"], "cloud.region": ["ap-southeast-2"], "service.name": ["aws-python-api-worker-project-dev-producer"], "processor.name": ["transaction"], "labels.faas_version": ["$LATEST"], "span.id": ["898690921573f428"], "span.subtype": ["http"], "observer.version_major": [7], "observer.hostname": ["60e5fd4e9651"], "span.type": ["external"], "cloud.provider": ["aws"], "timestamp.us": [1624453622779541], "@timestamp": ["2021-06-23T13:07:02.779Z"], "ecs.version": ["1.8.0"], "observer.type": ["apm-server"], "observer.version": ["7.13.1"], "agent.version": ["1.2.0"], "parent.id": ["8b8e0c642d96e5f2"], "span.destination.service.name": ["https://3y00jobl0c.execute-api.ap-southeast-2.amazonaws.com"], "span.destination.service.resource": ["3y00jobl0c.execute-api.ap-southeast-2.amazonaws.com:443"] }, "highlight": { "trace.id": ["@kibana-highlighted-field@60d331f41b8d383766d158e55359aaca@/kibana-highlighted-field@"] }, "sort": [1624453622779] } ``` ### Producer transaction ```json { "_index": "apm-7.13.1-transaction-000001", "_type": "_doc", "_id": "TtX7OHoBmWidblu2OYlN", "_version": 1, "_score": null, "fields": { "transaction.name.text": ["handler.producer"], "service.framework.version": ["0.21b0"], "observer.name": ["instance-0000000008"], "labels.faas_name": ["aws-python-api-worker-project-dev-producer"], "service.language.name": ["python"], "labels.faas_id": ["arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-producer"], "transaction.sampled": [true], "transaction.id": ["8b8e0c642d96e5f2"], "trace.id": ["60d331f41b8d383766d158e55359aaca"], "processor.event": ["transaction"], "agent.name": ["opentelemetry/python"], "labels.faas_execution": ["fdcc0a5f-ce57-4201-a631-6097957f4a78"], "event.outcome": ["unknown"], "cloud.region": ["ap-southeast-2"], "service.name": ["aws-python-api-worker-project-dev-producer"], "service.framework.name": ["opentelemetry.instrumentation.requests"], "processor.name": ["transaction"], "transaction.duration.us": [2259030], "labels.faas_version": ["$LATEST"], "observer.version_major": [7], "observer.hostname": ["60e5fd4e9651"], "transaction.type": ["custom"], "cloud.provider": ["aws"], "event.ingested": ["2021-06-23T13:07:06.189Z"], "timestamp.us": [1624453622778742], "@timestamp": ["2021-06-23T13:07:02.778Z"], "ecs.version": ["1.8.0"], "observer.type": ["apm-server"], "observer.version": ["7.13.1"], "agent.version": ["1.2.0"], "parent.id": ["918243f3ba6e3f71"], "transaction.name": ["handler.producer"] }, "highlight": { "trace.id": ["@kibana-highlighted-field@60d331f41b8d383766d158e55359aaca@/kibana-highlighted-field@"] }, "sort": [1624453622778] } ```
axw commented 3 years ago

@michaelhyatt would you be able to run the opentelemetry-collector with the logging exporter, and capture the logged spans? Then we can compare to the docs we're recording.

One thing I can see is that the consumer transaction's parent.id does not refer to either of the other events. Also, the producer transaction has a parent.id - so neither of the transactions is considered the root, hence why they don't show up under Traces. I'm guessing that the API Gateway is updating traceparent, but there doesn't appear to be any spans for API Gateway. I'm wildly guessing that API Gateway is sending XRay events but not OTel spans.

michaelhyatt commented 3 years ago

@axw I have the debug logs of the embedded OTel collector that recorded the messages it sends. 2 logs are attached. Here is the example of the debug of the span that the consumer reports:


2021-06-23T23:00:34.750Z    DEBUG   loggingexporter/logging_exporter.go:48  ResourceSpans #0
--
Resource labels:
-> telemetry.sdk.language: STRING(python)
-> telemetry.sdk.name: STRING(opentelemetry)
-> telemetry.sdk.version: STRING(1.2.0)
-> cloud.region: STRING(ap-southeast-2)
-> cloud.provider: STRING(aws)
-> faas.name: STRING(aws-python-api-worker-project-dev-consumer)
-> faas.version: STRING($LATEST)
-> service.name: STRING(aws-python-api-worker-project-dev-consumer)
InstrumentationLibrarySpans #0
InstrumentationLibrary opentelemetry.instrumentation.aws_lambda 0.16.dev0
Span #0
Trace ID       : 60d3bd0d1c71895e0786483748de7acf
Parent ID      : 84ecb3d0f1da98b3
ID             : ebd0715a456a282b
Name           : handler.consumer
Kind           : SPAN_KIND_SERVER
Start time     : 2021-06-23 23:00:34.733421785 +0000 UTC
End time       : 2021-06-23 23:00:34.734093944 +0000 UTC
Status code    : STATUS_CODE_UNSET
Status message :
Attributes:
-> faas.execution: STRING(6c00c38a-b1fe-4aac-955e-345a91831150)
-> faas.id: STRING(arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-consumer)
-> faas.name: STRING(aws-python-api-worker-project-dev-consumer)
-> faas.version: STRING($LATEST)
 
END RequestId: 6c00c38a-b1fe-4aac-955e-345a91831150

consumer.txt producer.txt

AlexanderWert commented 3 years ago

@michaelhyatt Can you try to disable X-Ray tracing in the API Gateway (https://docs.aws.amazon.com/xray/latest/devguide/xray-services-apigateway.html) and see whether it solves the observed problem?

michaelhyatt commented 3 years ago

@AlexanderWert I disabled the API gateway tracing and now it doesn't appear on X-Ray service map or waterfall diagram: image image

Kibana APM UI still doesn't show the trace.

These are the documents recorded in Elasticsearch. ## Consumer transaction ```json { "_index": "apm-7.13.2-transaction-000001", "_type": "_doc", "_id": "D-ToRXoB6AbCXZTTT6Dt", "_version": 1, "_score": null, "fields": { "transaction.name.text": ["handler.consumer"], "service.framework.version": ["0.16.dev0"], "observer.name": ["instance-0000000009"], "labels.faas_name": ["aws-python-api-worker-project-dev-consumer"], "service.language.name": ["python"], "labels.faas_id": ["arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-consumer"], "transaction.sampled": [true], "transaction.id": ["8d38b2d9bee1c20f"], "trace.id": ["60d681143464a3ff1c61880768907b0a"], "processor.event": ["transaction"], "agent.name": ["opentelemetry/python"], "labels.faas_execution": ["4bc4f6f3-71e3-4380-91d7-5277a79be05e"], "event.outcome": ["unknown"], "cloud.region": ["ap-southeast-2"], "service.name": ["aws-python-api-worker-project-dev-consumer"], "service.framework.name": ["opentelemetry.instrumentation.aws_lambda"], "processor.name": ["transaction"], "transaction.duration.us": [605], "labels.faas_version": ["$LATEST"], "observer.version_major": [7], "observer.hostname": ["1e0706f92f11"], "transaction.type": ["custom"], "cloud.provider": ["aws"], "event.ingested": ["2021-06-26T01:21:30.604Z"], "timestamp.us": [1624670489577889], "@timestamp": ["2021-06-26T01:21:29.577Z"], "ecs.version": ["1.8.0"], "observer.type": ["apm-server"], "observer.version": ["7.13.2"], "agent.version": ["1.2.0"], "parent.id": ["e53f6b191e1d0281"], "transaction.name": ["handler.consumer"] }, "highlight": { "trace.id": ["@kibana-highlighted-field@60d681143464a3ff1c61880768907b0a@/kibana-highlighted-field@"] }, "sort": [1624670489577] } ``` ## Producer span ```json { "_index": "apm-7.13.2-span-000001", "_type": "_doc", "_id": "EeToRXoB6AbCXZTTT6Dt", "_version": 1, "_score": null, "fields": { "span.destination.service.type": ["external"], "span.name": ["HTTP GET"], "destination.port": [443], "observer.name": ["instance-0000000009"], "labels.faas_name": ["aws-python-api-worker-project-dev-producer"], "trace.id": ["60d681143464a3ff1c61880768907b0a"], "span.duration.us": [2386427], "processor.event": ["span"], "agent.name": ["opentelemetry/python"], "destination.address": ["bydwx5ss24.execute-api.ap-southeast-2.amazonaws.com"], "event.outcome": ["success"], "cloud.region": ["ap-southeast-2"], "service.name": ["aws-python-api-worker-project-dev-producer"], "processor.name": ["transaction"], "labels.faas_version": ["$LATEST"], "span.id": ["e5edc4fb3045615a"], "span.subtype": ["http"], "observer.version_major": [7], "observer.hostname": ["1e0706f92f11"], "span.type": ["external"], "cloud.provider": ["aws"], "timestamp.us": [1624670487255663], "@timestamp": ["2021-06-26T01:21:27.255Z"], "ecs.version": ["1.8.0"], "observer.type": ["apm-server"], "observer.version": ["7.13.2"], "agent.version": ["1.2.0"], "parent.id": ["a64ec14bc1536181"], "span.destination.service.name": ["https://bydwx5ss24.execute-api.ap-southeast-2.amazonaws.com"], "span.destination.service.resource": ["bydwx5ss24.execute-api.ap-southeast-2.amazonaws.com:443"] }, "highlight": { "trace.id": ["@kibana-highlighted-field@60d681143464a3ff1c61880768907b0a@/kibana-highlighted-field@"] }, "sort": [1624670487255] } ``` ## Producer transaction ```json { "_index": "apm-7.13.2-transaction-000001", "_type": "_doc", "_id": "EOToRXoB6AbCXZTTT6Dt", "_version": 1, "_score": null, "fields": { "transaction.name.text": ["handler.producer"], "service.framework.version": ["0.21b0"], "observer.name": ["instance-0000000009"], "labels.faas_name": ["aws-python-api-worker-project-dev-producer"], "service.language.name": ["python"], "labels.faas_id": ["arn:aws:lambda:ap-southeast-2:401722391821:function:aws-python-api-worker-project-dev-producer"], "transaction.sampled": [true], "transaction.id": ["a64ec14bc1536181"], "trace.id": ["60d681143464a3ff1c61880768907b0a"], "processor.event": ["transaction"], "agent.name": ["opentelemetry/python"], "labels.faas_execution": ["4580f63e-2761-408e-99f3-046cb774d7e3"], "event.outcome": ["unknown"], "cloud.region": ["ap-southeast-2"], "service.name": ["aws-python-api-worker-project-dev-producer"], "service.framework.name": ["opentelemetry.instrumentation.requests"], "processor.name": ["transaction"], "transaction.duration.us": [2389360], "labels.faas_version": ["$LATEST"], "observer.version_major": [7], "observer.hostname": ["1e0706f92f11"], "transaction.type": ["custom"], "cloud.provider": ["aws"], "event.ingested": ["2021-06-26T01:21:30.605Z"], "timestamp.us": [1624670487254602], "@timestamp": ["2021-06-26T01:21:27.254Z"], "ecs.version": ["1.8.0"], "observer.type": ["apm-server"], "observer.version": ["7.13.2"], "agent.version": ["1.2.0"], "parent.id": ["7ade6ef5e6220237"], "transaction.name": ["handler.producer"] }, "highlight": { "trace.id": ["@kibana-highlighted-field@60d681143464a3ff1c61880768907b0a@/kibana-highlighted-field@"] }, "sort": [1624670487254] } ```
axw commented 3 years ago

@michaelhyatt I've transferred this to elastic/apm as (AFAICS) there's no bug with the translation of OpenTelemetry spans -- the server is translating them faithfully. It appears that API GW is injecting/mutating traceparent, but is not sending any OTel spans. When we receive an OTel span for Lambda, it has a parent span ID which points to nowhere.

That missing API GW span would theoretically be the trace root, and we base certain features on the trace roots, such as the list of traces. If API GW spans were sent to APM Server, then I expect it would all just work. Seems like missing/broken functionality in API GW to me.

michaelhyatt commented 3 years ago

@axw The latest test was done without API-GW tracing turned on. Are you saying the API-GW is still changing the trace header even though it is not part of X-Ray traces anymore?

axw commented 3 years ago

Yes, it looks like API GW is still adding a traceparent header before both Lambdas. The producer transaction has a field parent.id:7ade6ef5e6220237, and the consumer transaction has parent.id:e53f6b191e1d0281. There are no transactions or spans with these IDs.

AlexanderWert commented 3 years ago

Yes, it looks like API GW is still adding a traceparent header before both Lambdas.

Could it be the Lambda extension that is adding the traceparent (but not sending OTel spans but rather collects X-Ray-specific spans)? For instance for the Lambda initialization phase as shown in the screenshot above.

axw commented 3 years ago

Possibly, I don't know. I should have said X-Ray header anyway, since there's no W3C traceparent headers involved here. The bottom line is that we're being told the producer span has a parent which does not exist, and so none of the spans is considered to be the trace root.

cyrille-leclerc commented 3 years ago

I come to the same conclusion

Type (Transaction / Span) id parent.id timestamp.us name service.name
T a64ec14bc1536181 7ade6ef5e6220237 1624670487254602 handler.producer aws-python-api-worker-project-dev-producer
S e5edc4fb3045615a a64ec14bc1536181 1624670487255663 HTTP GET aws-python-api-worker-project-dev-producer
T 8d38b2d9bee1c20f e53f6b191e1d0281 1624670489577889 handler.consumer aws-python-api-worker-project-dev-consumer

The parent.id of the consumer/downstream transaction is not the trace.id of the parent HTTP GETspan as expected. It's an e53f6b191e1d0281 instead of e5edc4fb3045615a

michaelhyatt commented 3 years ago

I tested this setup with Jaeger and it is able to handle the case quite well, although it displayed warnings about the missing parents.

image

image