aws / aws-xray-sdk-python

AWS X-Ray SDK for the Python programming language
Apache License 2.0
327 stars 143 forks source link

Support boto3 S3 transfer manager #52

Open haotianw465 opened 6 years ago

haotianw465 commented 6 years ago

boto3 has a few high-level S3 API calls like upload_file, download_file which depends on s3transfer to perform multi-threaded object puts/gets to increase through put. This will result in a SegmentNotFoundException as the X-Ray recorder tries to capture the "real" S3 API call but it loses the context because the actual http call is in a worker thread from the thread pool.

The S3 transfer manager under the hood uses futures.ThreadPoolExecutor per https://github.com/boto/s3transfer/blob/develop/s3transfer/futures.py#L370 but there is no proper API on boto3 client level to propagate context from user code. And requiring user code changes is not a good customer experience.

The SDK should somehow monkey-patch S3 transfer manager so that it automatically propagate context to all worker threads so each http outbound call is captured and attached to its parent segment or subsegment properly.

Library django-storages or any storage library that supports S3 as back-end and uses boto3 certain APIs might face the same issue.

More detailed technical deep dive could be found here: https://github.com/aws/aws-xray-sdk-python/issues/4.

rbu commented 5 years ago

I just spent over a day digging deep into why the segment could not be found (suspecting other components at first). The current situation has a bad failure mode, the "additive" nature of "patch_all" actually breaks the basic AWS API component, boto3 with an error that points you into a completely wrong direction (namely, that no segment is defined, when clearly, there is).

My minimal reproducer:

from aws_xray_sdk.core import patch_all, xray_recorder
patch_all()
xray_recorder.begin_segment("pytest")

import boto3
bucket = boto3.resource('s3').create_bucket(Bucket="test23243534534634")
obj = bucket.Object('file.txt')
obj.put(Body="body")
assert b"body" == obj.get()['Body'].read()

import tempfile
with tempfile.NamedTemporaryFile(mode="r") as f:
    obj.download_file(f.name)
    assert "body" == open(f.name).read()

raises:

cannot find the current segment/subsegment, please make sure you have a segment open
---------------------------------------------------------------------------
SegmentNotFoundException                  Traceback (most recent call last)
<ipython-input-7-677aa688eccd> in <module>
      1 import tempfile
      2 with tempfile.NamedTemporaryFile(mode="r") as f:
----> 3     obj.download_file(f.name)
      4     assert "body" == open(f.name).read()

~/.virtualenvs/project/lib/python3.7/site-packages/boto3/s3/inject.py in object_download_file(self, Filename, ExtraArgs, Callback, Config)
    312     return self.meta.client.download_file(
    313         Bucket=self.bucket_name, Key=self.key, Filename=Filename,
--> 314         ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
    315 
    316 

~/.virtualenvs/project/lib/python3.7/site-packages/boto3/s3/inject.py in download_file(self, Bucket, Key, Filename, ExtraArgs, Callback, Config)
    170         return transfer.download_file(
    171             bucket=Bucket, key=Key, filename=Filename,
--> 172             extra_args=ExtraArgs, callback=Callback)
    173 
    174 

~/.virtualenvs/project/lib/python3.7/site-packages/boto3/s3/transfer.py in download_file(self, bucket, key, filename, extra_args, callback)
    305             bucket, key, filename, extra_args, subscribers)
    306         try:
--> 307             future.result()
    308         # This is for backwards compatibility where when retries are
    309         # exceeded we need to throw the same error from boto3 instead of

~/.virtualenvs/project/lib/python3.7/site-packages/s3transfer/futures.py in result(self)
    104             # however if a KeyboardInterrupt is raised we want want to exit
    105             # out of this and propogate the exception.
--> 106             return self._coordinator.result()
    107         except KeyboardInterrupt as e:
    108             self.cancel()

~/.virtualenvs/project/lib/python3.7/site-packages/s3transfer/futures.py in result(self)
    263         # final result.
    264         if self._exception:
--> 265             raise self._exception
    266         return self._result
    267 

~/.virtualenvs/project/lib/python3.7/site-packages/s3transfer/tasks.py in _main(self, transfer_future, **kwargs)
    253             # Call the submit method to start submitting tasks to execute the
    254             # transfer.
--> 255             self._submit(transfer_future=transfer_future, **kwargs)
    256         except BaseException as e:
    257             # If there was an exception raised during the submission of task

~/.virtualenvs/project/lib/python3.7/site-packages/s3transfer/download.py in _submit(self, client, config, osutil, request_executor, io_executor, transfer_future, bandwidth_limiter)
    343                 Bucket=transfer_future.meta.call_args.bucket,
    344                 Key=transfer_future.meta.call_args.key,
--> 345                 **transfer_future.meta.call_args.extra_args
    346             )
    347             transfer_future.meta.provide_transfer_size(

~/.virtualenvs/project/lib/python3.7/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    355                     "%s() only accepts keyword arguments." % py_operation_name)
    356             # The "self" in this scope is referring to the BaseClient.
--> 357             return self._make_api_call(operation_name, kwargs)
    358 
    359         _api_call.__name__ = str(py_operation_name)

~/.virtualenvs/project/lib/python3.7/site-packages/aws_xray_sdk/ext/botocore/patch.py in _xray_traced_botocore(wrapped, instance, args, kwargs)
     38         name=service,
     39         namespace='aws',
---> 40         meta_processor=aws_meta_processor,
     41     )

~/.virtualenvs/project/lib/python3.7/site-packages/aws_xray_sdk/core/recorder.py in record_subsegment(self, wrapped, instance, args, kwargs, name, namespace, meta_processor)
    415                 self.context.put_segment(segment)
    416 
--> 417         subsegment = self.begin_subsegment(name, namespace)
    418 
    419         exception = None

~/.virtualenvs/project/lib/python3.7/site-packages/aws_xray_sdk/core/recorder.py in begin_subsegment(self, name, namespace)
    282         """
    283 
--> 284         segment = self.current_segment()
    285         if not segment:
    286             log.warning("No segment found, cannot begin subsegment %s." % name)

~/.virtualenvs/project/lib/python3.7/site-packages/aws_xray_sdk/core/recorder.py in current_segment(self)
    265         same thread.
    266         """
--> 267         entity = self.get_trace_entity()
    268         if self._is_subsegment(entity):
    269             return entity.parent_segment

~/.virtualenvs/project/lib/python3.7/site-packages/aws_xray_sdk/core/recorder.py in get_trace_entity(self)
    367         A pass through method to ``context.get_trace_entity()``.
    368         """
--> 369         return self.context.get_trace_entity()
    370 
    371     def set_trace_entity(self, trace_entity):

~/.virtualenvs/project/lib/python3.7/site-packages/aws_xray_sdk/core/context.py in get_trace_entity(self)
     91         """
     92         if not getattr(self._local, 'entities', None):
---> 93             return self.handle_context_missing()
     94 
     95         return self._local.entities[-1]

~/.virtualenvs/project/lib/python3.7/site-packages/aws_xray_sdk/core/context.py in handle_context_missing(self)
    116         if self.context_missing == 'RUNTIME_ERROR':
    117             log.error(MISSING_SEGMENT_MSG)
--> 118             raise SegmentNotFoundException(MISSING_SEGMENT_MSG)
    119         else:
    120             log.error(MISSING_SEGMENT_MSG)

SegmentNotFoundException: cannot find the current segment/subsegment, please make sure you have a segment open
kut commented 4 years ago

Hey @haotianw465 thanks for the detail, been grappling with this issue for a long time without knowing. Any suggestion on what we should do to allow tracing with S3 TransferManager in the meantime? Or is it such a headache that you'd recommend we skip tracing when using S3 TransferManager for the time being?

SpoonOfDoom commented 3 years ago

Is there any news on this, or an easy workaround? I'd like to keep tracing enabled if possible.

goudhabuddha commented 2 years ago

I was recently burned by this. Any update on a fix?

wangzlei commented 2 years ago

Unfortunately, xray sdk and other popular sdks such as OpenTelemetry don't support this case yet. S3 APIs download_file and upload_file will split large file to small files to run in parallel, the trace context are not propagated to these sub threads then the instrument failed. Need a feature to support these APIs.

Instead of supporting new feature in XRay SDKs, we want to support it in OpenTelemetry first. Will cut a request in OTel community.

eercanayar commented 2 years ago

I've spent ~3 hours to troubleshoot this. I would vote for giving a meaningful error message at least, to not waste anyone's hours.

radimsuckr commented 2 years ago

Hi @eercanayar, did you find any workaround for this?

I'd like to disable X-Ray tracing only for the S3 client but I'm not entirely sure that's currently possible.

SpoonOfDoom commented 2 years ago

@radimsuckr

I have done something like this to temporarily disable X-Ray, and at least in my case it was enough to get around the issue. Maybe it helps?

from aws_xray_sdk import global_sdk_config

def disable_xray():
    os.environ['AWS_XRAY_SDK_ENABLED'] = "0"
    global_sdk_config.set_sdk_enabled(False)

def enable_xray():
    os.environ['AWS_XRAY_SDK_ENABLED'] = "1"
    global_sdk_config.set_sdk_enabled(True)

And then call where needed.

disable_xray()
do_s3_stuff()
enable_xray()

It's a bit fiddly, but at the time I didn't find a better workaround.

eercanayar commented 2 years ago

Hi @radimsuckr,

I've migrated to put-object API since my goal was to upload a single file. I'll check this issue internally to see if our SDK team can resolve this, at least not triggering an Exception for non-supported APIs.

radimsuckr commented 2 years ago

@SpoonOfDoom thank you, I'm trying something similar but I'm afraid I won't get anywhere because of multiple threads overwriting each other's value of the env variable. :disappointed:

@eercanayar sadly that's not possible for me, we need the multipart download/upload. Of course the best outcome would be a fix for this issue, but acceptable workaround for now is not triggering the error as you mentioned. Thank you for checking.

We set AWS_CONTEXT_MISSING to LOG_ERROR so it doesn't kill the app, but the amount of logs even from simple Django collectstatic command is absolutely ridiculous. One other solution that comes to my mind is ignoring all logs from aws-xray-sdk because frankly they're quite useless...

radimsuckr commented 2 years ago

Hi @eercanayar, did you have by any chance time for the check you mentioned? Thanks!

NathanielRN commented 2 years ago

Hi all,

Sorry for the frustration with this issue. I've filed #334 to suggest the short term fix that was mentioned. We hope to get to it soon.

In the meantime, even though as @wangzlei mentioned above this is a hard problem that OpenTelemetry (OTel) Python has not even solved with their AWS instrumentation, we still recommend onboarding with them because OTel Python is getting a lot of support and it's compatibility with AWS X-Ray has been GA since September of 2021. It should be easy to on-board, should give you more features and easier support 🙂

eercanayar commented 2 years ago

Hi @radimsuckr, I was away for several weeks but we were discussing the most optimal short term fix, internally. Now, we're discussing on #334 to have a third option as IGNORE to avoid flood of logs. In this case, you can enable xray-sdk or its logs for your apps happily. We hope to get to it soon.

Thanks @NathanielRN for your bias for action! 🚀

bitshop commented 8 months ago

the patch_all method wrongly patches some of the aws sdk. In order to use xray with threads you have to pass the thread context to the new threads, let's say I have a function:

def example_function(test, trace_entity=None):
        if trace_entity is not None:
              xray_recorder.set_trace_entity(trace_entity)
        return test

# When you invoke the function:
    with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        executor.submit(example_function, test, trace_entity=xray_recorder.get_trace_entity())

If you don't do this you end up with errors similar to this.

This issue may not seem urgent but my challenge is this is causing a hard time finding where our code is breaking traces, since I can't troubleshoot OUR exceptions for missing context - I can't even get to them even with several code changes already trying to work around this. This is a pretty significant bug when basic AWS SDK functions don't work after calling patch_all.

For now my work around will be to remove patch_all, but I also need AWS SDK timing to know if S3 one zone express solves some performance problems I'm working on :-(

Hopefully someone familiar with patch_all can test my above ideas and maybe get a PR made.