aws / aws-sdk-js

AWS SDK for JavaScript in the browser and Node.js
https://aws.amazon.com/developer/language/javascript/
Apache License 2.0
7.59k stars 1.55k forks source link

Header signing error affecting opentelemetry instrumentation #4472

Closed Ankcorn closed 6 months ago

Ankcorn commented 1 year ago

Describe the bug

Hey, we ran into this weird issue with aws-sdk request retries failing when instrumented with open telemetry.

The error we spotted was

The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

https://github.com/open-telemetry/opentelemetry-js/issues/3922

I recreated the error in our public sandbox so its clear what went wrong https://sandbox.baselime.io/baselime/sandbox/datasets/otel/traces/dc4e6324537d09513348ab781fb318ff

The traceparent header is updated for the retried http request but the x-amz-security-token is not. This causes requests to fail.

Expected Behavior

Would expect the the request to not error

Current Behavior

The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

Reproduction Steps

When using @opentelemetry/instrumentation-http version 0.35.1 or higher, whenever aws-sdk v2.x.x retries a request (e.g.: rate limit or timeout happens), the request fails, because aws-sdk adds the traceparent to the SignedHeaders, and as a result AWS throws a InvalidSignatureException 400 back.

Possible Solution

Ignore traceparent header like the x-amz-trace-id is

https://github.com/aws/aws-sdk-js/blob/7d80de6414a10a349b912db23d1fd0de0ad60a3e/lib/signers/v4.js#L198

Additional Information/Context

No response

SDK version used

2.1374.0

Environment details (OS name and version, etc.)

AWS Lambda, node 16

benkehoe commented 1 year ago

Interesting; the list of headers to exclude from signing in JSv2: https://github.com/aws/aws-sdk-js/blob/7d80de6414a10a349b912db23d1fd0de0ad60a3e/lib/signers/v4.js#L191-L199 is significantly longer than in botocore: https://github.com/boto/botocore/blob/a9397847a3150eb619f3457e14557afd12ff13ab/botocore/auth.py#L61-L65 open telemetry headers aren't in there either.

Maybe this issue belongs in the cross-SDK repo?

Ankcorn commented 1 year ago

Thanks @benkehoe

It is surprising that they differ significantly. I think for now this issue only crops up with the js client because the javascript https://github.com/open-telemetry/opentelemetry-js/tree/main/experimental/packages/opentelemetry-instrumentation-http adds the traceparent header to every request.

Adding an AWS specific code path to the HTTP instrumentation felt weird so I raised this here

benkehoe commented 1 year ago

I'm confused. Is the OpenTelemetry JavaScript implementation doing something that is not standard for OpenTelemetry? If so, it seems like the OpenTelemetry implementation should be changed. If the OpenTelemetry JavaScript implementation is following the OpenTelemetry spec, we should expect that, for example, a Python implementation of OpenTelemetry would also cause signing issues for the Python SDK (boto3)

Ankcorn commented 1 year ago

I'm always confused :laughing:

The reason I opened this here is that I think the unexpected behaviour is coming from the aws-sdk rather than the instrumentation-http package.

The instrumentation-http package passes the traceparent to all downstream HTTP endpoints. There are conversations about being able to allow/deny propagation for specific endpoints https://github.com/open-telemetry/opentelemetry-js/issues/1698 but I don't think anyone is actively working on this. This would still lead to friction for new open telemetry users wanting to instrument aws-sdk requests.

When the request retries at the moment the signature is not recalculated for subsequent HTTP requests even though some of the headers are now different. I think this is because of the signature caching logic. Does boto cache the signature? I don't see this in the aws-sdk-js v3 which might be why we don't see the problem there

edit:

@benkehoe I think boto3 doesn't cache the token so I would guess when the request is retried a new signature is created https://github.com/boto/botocore/blob/1fad151309322a88641571ac188af8e68ba56a59/botocore/auth.py#L444C23-L444C23

benkehoe commented 1 year ago

ah, a signature caching issue makes sense

RanVaknin commented 1 year ago

Hi @Ankcorn,

Thanks for raising the issue. Aside from botocore did you notice this issue with other SDKs?
Newer SDKs have a more uniform list of unsignable headers as they all should follow the same type of internal RFC like paper. Older SDKs implementations might differ.

While we look into adding the traceparent header into the the list, you can workaround it by doing:

AWS.Signers.V4.prototype.unsignableHeaders.push('traceparent');

Thanks, Ran~

Ankcorn commented 1 year ago

Hey @RanVaknin

Thanks for the workaround, that helps a lot :) I have confirmed that it works.

I have taken a good look at botocore and the aws-sdk-js v3.

It helped me realise here that the root cause for the js v2 issue was not the headers but the caching of the signature, which we only noticed because of the traceheader changinf. The js v2 SDK seems to be the only one that does this.

Hope that update helps :)

mugli commented 7 months ago

This PR seems to fix the issue: https://github.com/open-telemetry/opentelemetry-js/pull/4346 which is released in @opentelemetry/instrumentation-http v0.46.0: https://github.com/open-telemetry/opentelemetry-js/pull/4358