aws / aws-xray-sdk-python

AWS X-Ray SDK for the Python programming language
Apache License 2.0
328 stars 143 forks source link

Custom emitter based on boto3 creates an infinite loop in the SDK #379

Open nspsngh opened 1 year ago

nspsngh commented 1 year ago

Hello,

I have a requirement for instrumenting a python application (specifically, an AWS Glue ETL application) using AWS X-Ray SDK but cannot have the X-Ray daemon running. Consequently, I must provide a custom emitter that can stand in place of the default UDPEmitter. I can go into more details of the implementation as required, the gist is that the implementation works as intended until patch_all method is called. The patcher will patch up the botocore and other lower level libraries, more pertinently httplib.

After having run the patcher, when PutTraceSegments API call is attempted, the application goes into infinite loop when SSO is enabled. The infinity manifests more specifically after having made a call to GetRoleCredentials API. Although not tested, I suspect this condition will occur even if SSO is not enabled and other types of credentials are used.

Goes something like below (my emitter is called HttpEmitter):

AWSXRayRecorder.capture() -> AWSXRayRecorder.record_subsegment() -> HttpEmitter.send_entity() -> GetRoleCredentials (API) -> AWSXRayRecorder.capture() -> AWSXRayRecorder.record_subsegment() -> HttpEmitter.send_entity() ...

Referencing the patcher module for botocore at aws_xray_sdk/ext/botocore/patch.py, the issue is resolved by excluding GetRoleCredentials from tracing. In other words, if GetRoleCredentials is excluded from patching, the custom emitter based on boto3 works as expected.

def _xray_traced_botocore(wrapped, instance, args, kwargs):
    service = instance._service_model.metadata["endpointPrefix"]
    if service == 'xray':
        # skip tracing for SDK built-in sampling pollers
        if ('GetSamplingRules' in args or
            'GetSamplingTargets' in args or
                'PutTraceSegments' in args):
            return wrapped(*args, **kwargs)

    if 'GetRoleCredentials' in args:
        return wrapped(*args, **kwargs)

    return xray_recorder.record_subsegment(
        wrapped, instance, args, kwargs,
        name=service,
        namespace='aws',
        meta_processor=aws_meta_processor,
    )

Specifcally, this if is added:


if 'GetRoleCredentials' in args:
        return wrapped(*args, **kwargs)

However, I am not certain if this is in fact the correct solution or symptomatic of a more fundamental problem elsewhere, hence the ticket. I say this because the infinite loop can be made to appear just by patching httplib alone but it may well be for the same reason as above. In order to move forward, I have, for the moment, replaced the patcher for botocore with a custom implementation that includes the shown exclusion.

Further guidance is appreciated.

srprash commented 1 year ago

Hi @nsp-aws I think your solution to exclude the GetRoleCredentials is correct in this case. From the SDK side itself, it may be difficult to identify all the possible AWS operations that may be called in a custom emitter. One solution that can possibly work is to let users provide a set of operations they want to ignore when botocore is patched. Similar to what has been done for the httplib patch.

Also, not sure how feasible it would be for you to use OpenTelemetry, but you can try writing your own SpanExporter (refer to the ConsoleSpanExporter) if that works for you.

nspsngh commented 1 year ago

Appreciate the response. Yes, having the ability to configure the operations that should be excluded would be adequate. OpenTelemetry is not being used on the current project so writing a custom exporter is not quite feasible. The core idea here was not have an external compute that would host either the collector or the X-RAY daemon. We needed to run the entire tracing system in-process.

Are you willing to accept a PR for configuring the operations for exclusion when patching boto?

srprash commented 1 year ago

Are you willing to accept a PR for configuring the operations for exclusion when patching boto?

Absolutely! I will be happy to review such a PR.