gchq / sleeper

A cloud-native, serverless, scalable, cheap key-value store
Apache License 2.0
55 stars 11 forks source link

Instrument for AWS X-Ray with OpenTelemetry #1958

Open patchwork01 opened 4 months ago

patchwork01 commented 4 months ago

Background

Split from:

Description

We'd like to instrument our lambas and Fargate tasks for AWS X-Ray, so that we can trace execution in a few cases:

Analysis

OpenTelemetry instrumentation

We tried instrumenting with the ADOT Lambda layer:

https://aws-otel.github.io/docs/getting-started/lambda

This didn't fit into the code size limit for our lambdas.

We have a couple of alternatives to make it fit:

Alternative libraries

We could use the AWS X-Ray SDK or OpenTelemetry:

https://docs.aws.amazon.com/xray/latest/devguide/xray-instrumenting-your-app.html

Given that all of Sleeper is deployed in AWS, and all the entrypoints into Sleeper are packaged specifically for AWS, if all we need is default auto-instrumentation, it seems reasonable to use AWS' own SDK.

We can see how much information we get from adding the AWS X-Ray SDK in all deployed artifacts, and enabling tracing on lambdas attached to CloudWatch rules.

See the AWS documentation for AWS X-Ray instrumentation:

https://docs.aws.amazon.com/lambda/latest/dg/services-xray.html https://docs.aws.amazon.com/lambda/latest/dg/java-tracing.html#java-xray-sdk https://docs.aws.amazon.com/xray/latest/devguide/xray-sdk-java.html

We split out a separate issue for the option to instrument with the AWS X-Ray SDK:

Modules to instrument

We'll need to make sure that every module we instrument is not depended on by other modules, otherwise we'd add X-Ray instrumentation there unintentionally. It's probably not a problem if X-Ray gets added to the system test drivers module, which depends on several modules we'll need to instrument.

The bulk import runner will run inside EMR, which seems like it might not work well with X-Ray. We can handle that separately if we want it later.

Ingest tasks are built from the ingest-runner module, but there are other modules that also depend on it, including the Trino plugin and the bulk import runner. We'll want to avoid adding X-Ray to those, so we'll need to split a separate module out of ingest-runner for the code that will run in ECS for the ingest task. We can make that a separate issue.

We could instrument the custom CDK resources but that seems unnecessary.

Modules to instrument:

Instrumentation libraries detail

We started by enabling auto-instrumentation with the AWS X-Ray SDK. This gives useful information, but in order to get more granular, eg. state store methods, we would need to add this as a dependency to other modules. The AWS X-Ray SDK requires at minimum a dependency on aws-xray-recorder-sdk-core, in order to report on individual method calls. This includes a dependency on aws-java-sdk-xray and aws-java-sdk-core, which seems a little excessive.

The AWS X-Ray SDK for Java also requires use of the X-Ray daemon, which we deployed as a sidecar to our Fargate task. This is a bit fiddly, as the memory and CPU requirements can be configured independently, and must be for EC2.

The AWS X-Ray SDK dependencies also aren't easy to disable once you add them as dependencies. The library for setting up auto-instrumentation is a Maven dependency, and it's relatively heavyweight.

The AWS Distro for OpenTelemetry lets you report to AWS X-Ray using the OpenTelemetry libraries instead, which include a much more minimal API. It doesn't require the X-Ray daemon, and the auto-instrumentation is in agent code which doesn't need to be added as a dependency. We can try using that instead:

https://docs.aws.amazon.com/xray/latest/devguide/xray-instrumenting-your-app.html#xray-instrumenting-opentel https://docs.aws.amazon.com/lambda/latest/dg/java-tracing.html#java-adot

patchwork01 commented 3 months ago

On hold because we might want to leave this until later if we don't need detailed tracing to test the transaction log state store.