aws / aws-xray-sdk-java

The official AWS X-Ray Recorder SDK for Java.
Apache License 2.0
94 stars 98 forks source link

S3 TransferManager incompatible with AWS X-Ray #313

Open Zhenye-Na opened 2 years ago

Zhenye-Na commented 2 years ago

Hello

We are currently using XRay for the services we own, and one of the API involves files transfer, so I add the dependency of using S3 transferManager. However this throws XRay "SegmentNotFoundException".

Spend a little time checking what is the root cause and it turns out that it is because transferManager creates a thread pool and XRay is not able to gather context for the threads that transferManager created.

I am wondering any available solution for this already, having checked the following resources, but no luck

resources:

  1. https://docs.aws.amazon.com/xray/latest/devguide/xray-sdk-java-multithreading.html
    1. https://github.com/aws-samples/eb-java-scorekeep/blob/xray/src/main/java/scorekeep/MoveFactory.java#L70-L79
  2. https://stackoverflow.com/questions/53841672/aws-xray-sdk-issue-failed-to-begin-subsegment-named-amazon-s3-segment-cannot
  3. https://github.com/aws/aws-sdk-java/issues/1572
willarmiros commented 2 years ago

Hi @Zhenye-Na,

Thank you for raising this. You're on the right track. Basically the X-Ray SDK stores segment context using ThreadLocal. It uses this context to capture outgoing AWS SDK requests & generate a subsegment for them. If there's no context available, the SDK throws a SegmentNotFoundException. If transferManager creates a thread pool and uses new threads to send requests, then the X-Ray SDK will attempt to capture them and fail due to empty threadlocal, causing this exception.

To ignore this error, you can set the env var AWS_XRAY_CONTEXT_MISSING=IGNORE_ERROR, though of course this will cause some requests to not be instrumented. I'm not sure if the AWS SDK exposes enough of their implementation for us to hook into the new thread pool and capture these requests, nor do I think we'd have the bandwidth to extend our instrumentation to support this case. However I would recommend you open this feature request in the OpenTelemetry Java repo as well since they have an AWS SDK instrumentation that could be extended to support this.

Zhenye-Na commented 2 years ago

Hello @willarmiros

Thank you so much for your reply and confirmation on the experiments I did. Basically what happened after this is we decided to temporarily bypass the SegmentNotFoundException by using the low level API that S3 team provided to do multi-part uploading and XRay works well with it so far.

I will open a feature request in the repo you mentioned above. However, I am not very familiar with the "terminology" / detailed process to solve this problem. Do you mind if I cc you later in the new issue I raised for OpenTelimetry team?

Thank you so much!

Merry Xmas 🎅

Zhenye-Na commented 2 years ago

add some details on my own experiments for someone comes to this issues:

  1. Instead of setting env var, I did AWSXRay.withContextMissingStrategy(IgnoreErrorXXXStrategy) this does not throw any exceptions which is nice, but the request is timed out.
  2. In the code that transferManager create threadPool, try to retrieve the traceEntity of the GlobalRecorder and beginSubsegment() in each threads that transferManager created. -> either timed out or exception thrown
willarmiros commented 2 years ago

Do you mind if I cc you later in the new issue I raised for OpenTelimetry team?

No problem

but the request is timed out.

Hmm so just adding X-Ray instrumentation and the ignore error strategy caused the request to time out? That's strange. It might have something to do with how transferManager works. Feel free to post some reproduction code, but glad you have a workaround for now!

Zhenye-Na commented 2 years ago

https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/6104

Issue created in OpenTelemetry Java, lets see how this goes

Zhenye-Na commented 2 years ago

Also, raised one ticket in AWS SDK v2 to see if we get the chance to fix this

https://github.com/aws/aws-sdk-java-v2/issues/3217

Zhenye-Na commented 11 months ago

I am wondering if this issue will be included in the roadmap ?

Or are there any workarounds if we would like to continue use X-Ray in a multi-threading env