aws / aws-xray-sdk-dotnet

The official AWS X-Ray SDK for .NET.
Apache License 2.0
110 stars 64 forks source link

Lambda called by SQS Event Trigger - supported? #110

Open carpnick opened 4 years ago

carpnick commented 4 years ago

Why does this line exist? I am attempting to do a single trace from API GW --> Lambda Function --> SQS --> Lambda (SQS Event Trigger lambda poller) --> Lambda function

In my lambda function I want to create a segment that is part of the trace pipeline. I have done all the unwrapping and setting of correct environment variables from the SQS message, but then I see this line. Can someone explain in detail why the lambda cant create the segment?

I do realize typically Lambda adds its own segment automatically, but in this use case it does not.

srprash commented 4 years ago

Hi @carpnick , To answer your question as to why creating a segment within a lambda function is unsupported is because typically a segment represents a microservice. Since lambda is a managed service, it creates a segment of its own and users can attach only subsegments to it.

For your use case above, currently you should see a segment for your event triggered lambda function and a disconnected service map between the sqs and lambda. If you could provide your service map and the raw trace data, I can take a look into the issue.

Thanks, Prashant.

carpnick commented 4 years ago

Yes I can make the disconnected service map show up if I enable active tracing on the lambda function on the other side of the SQS queue. What i was going for was a single trace that went all the way through the SQS Event trigger without enabling active tracing on the lambda function.

carpnick commented 4 years ago

Probably not worth giving you the trace data, since I can make what you suggest happen by enabling active tracing. If I dont enable active tracing, once it gets to the SQS queue, all tracing stops, even if the SQS client SDK adds the trace id into the sqs message attributes.

kosmakoff commented 4 years ago

Hi @carpnick Were you able to solve your problem? I am having the same exact problem as you and I have already spent too much time trying to solve it.

Could you share your code for trace context recovery?

Mine is as follows, translated from Java example here, and it doesn't seem to work, the whole process is being traced under the original Trace ID of that Lambda:

// traceHeader is calculated above and is a trace header of previous lambda, passed via SQS
var entity = AWSXRayRecorder.Instance.TraceContext.GetEntity();
entity.TraceId = traceHeader.RootTraceId;
entity.ParentId = traceHeader.ParentId;
entity.Sampled = traceHeader.Sampled;
carpnick commented 4 years ago

@kosmakoff , Yeah I gave up on the recovery piece. The problem with doing the recovery is the code as it stands cannot create a new segment within Lambda, and the Lambda Service/Function do not show up correctly because the Lambda Polling service does not propagate the trace id correctly.

The way I solved this problem is as follows:

In this way I have end 2 end tracing even if they are disconnected traces. Basically I can do a trace filter in XRAY and filter based on my custom annotation with my own generated trace id. This will then list the traces that are for that specific request. So while I might not have one trace from E2E through a queue, I get the equivalent of it in multiple traces. So if you have an architecture that hops through a couple Queues, you can just keep appending your own generated ID as far as you want it to go. Then you get an active trace session that is separate for each one. It is still providing you the traces of how long each process takes, just not as a nice neat single trace from E2E.

My 2cents.

srprash commented 4 years ago

Hi @carpnick and @kosmakoff , This issue is something that we have in our backlog and will be prioritized. Please stay tuned for updates.

Thanks.

jstallm commented 4 years ago

@srprash , Is this still in the backlog? Its been close to 1 year and I just wanted to make sure. I am running into this same feature need and would like to know what the thoughts are at this point. I would imagine a lot of customers really wanting to trace end to end on an sqs workflow when lambdas are triggered from sqs messages. Here is what things look like today, with SQS messages being the last node in the trace.

image

srprash commented 3 years ago

Hi @jstallm Apologies for such a long wait for this support and I really appreciate the patience. I can assure you that we are actively working on this issue and we will provide an update as soon as possible. Thanks, Prashant

nimatt commented 3 years ago

I really needed this to be able to see proper traces, so I created a workaround until we get an official alternative.

Warning: The code is using reflection to modify the SDK behaviour and with that allow creation of segments. This is of course very fragile and could break with an update. So use on your own peril...

Also note, that this does of course not give you Initialization and Overhead segments since they are per lambda and this will be per record in the event (since different records might have different traces).

I created a couple of extension methods to simplify usage of this in all SQS lambdas

public static IDisposable PickUpTrace(this SQSEvent.SQSMessage record, ILambdaContext context)
{
    var traceHeaderStr = record.Attributes.GetValueOrDefault("AWSTraceHeader", null);
    if (string.IsNullOrWhiteSpace(traceHeaderStr))
    {
        return Disposable.Create(() => { });
    }

    return TraceHeader.FromString(traceHeaderStr).PickUpTrace(context);
}

public static IDisposable PickUpTrace(this TraceHeader traceHeader, ILambdaContext context)
{
    var originalEntity = AWSXRayRecorder.Instance.GetEntity();
    EnsureNotInLambda(context);

    var newRoot = new Segment(context.FunctionName)
    {
        Origin = "AWS::Lambda::Function",
        StartTime = ((decimal)DateTimeOffset.Now.ToUnixTimeMilliseconds()) / 1000,
        TraceId = traceHeader.RootTraceId,
        ParentId = traceHeader.ParentId,
        Sampled = traceHeader.Sampled
    };
    newRoot.Aws.Add("function_arn", context.InvokedFunctionArn);
    newRoot.Aws.Add("resource_names", new[] { context.FunctionName });

    AWSXRayRecorder.Instance.SetEntity(newRoot);

    return new DisposableAction(() =>
    {
        AWSXRayRecorder.Instance.EndSegment();
        AWSXRayRecorder.Instance.SetEntity(originalEntity);
    });
}

private static void EnsureNotInLambda(ILambdaContext context)
{
    if (AWSXRayRecorder.IsLambda())
    {
        MethodInfo? isLambdaMethod = typeof(AWSXRayRecorder).GetMethod("IsLambda", BindingFlags.Static | BindingFlags.Public);
        MethodInfo? isFalseMethod = typeof(TraceHeaderExtensions).GetMethod("IsFalse", BindingFlags.Static | BindingFlags.NonPublic);
        if (isFalseMethod == null || isLambdaMethod == null)
        {
            context.Logger.Log("Unable to replace is in lambda check method on AWSXRayRecorder");
            return;
        }
        // Note: This won't work if a debugger is attached since it will inject code that changes method location
        // But we won't be in a lambda with a debugger attached.
        unsafe
        {
            if (IntPtr.Size == 4)
            {
                int* isLambdaPtr = (int*)isLambdaMethod.MethodHandle.Value.ToPointer() + 2;
                int* isFalsePtr = (int*)isFalseMethod.MethodHandle.Value.ToPointer() + 2;
                *isLambdaPtr = *isFalsePtr;
            }
            else
            {
                long* isLambdaPtr = (long*)isLambdaMethod.MethodHandle.Value.ToPointer() + 1;
                long* isFalsePtr = (long*)isFalseMethod.MethodHandle.Value.ToPointer() + 1;
                *isLambdaPtr = *isFalsePtr;
            }
        }
    }
}

private static bool IsFalse() => false;

private class DisposableAction: Disposable
{
    private readonly Action onDispose;

    public DisposableAction(Action onDispose)
    {
        this.onDispose = onDispose;
    }

    protected override void OnDispose()
    {
        base.OnDispose();
        onDispose?.Invoke();
    }
}

The usage of this becomes rather simple

public async Task<string> HandlerAsync(SQSEvent request)
{
    AWSSDKHandler.RegisterXRayForAllServices();

    foreach (var record in request.Records)
    {
        using (record.PickUpTrace(Context))
        {
            await ProcessRecord(record);
        }
    }

  return $"Processed {request.Records.Count} records";
}
williamdenton commented 3 years ago

@nimatt do you think it would be safer to change the behaviour of IsLambda() by momentarily removing then putting back this env var rather than replacing with reflection when calling Begin|EndSegment?

https://github.com/aws/aws-xray-sdk-dotnet/blob/d2b64125ecdd734af220795d5b102bd96b037981/sdk/src/Core/AWSXRayRecorder.netcore.cs#L36

https://github.com/aws/aws-xray-sdk-dotnet/blob/d2b64125ecdd734af220795d5b102bd96b037981/sdk/src/Core/AWSXRayRecorder.netcore.cs#L429-L439

sdiakovskyi-gd commented 3 years ago

@nimatt @williamdenton Yep, I've used this approach - for me, it seems a bit better than unsafe code.

Anyway, I believe this should be implemented in the X-Ray itself because this is still 2 separate traces, just a hack to see both of them in the service map.

    public static class XRayExtensions
    {
        public static IDisposable PickUpTrace(this SQSEvent.SQSMessage record, ILambdaContext context)
        {
            var traceHeaderStr = record.Attributes.GetValueOrDefault("AWSTraceHeader", null);
            Console.WriteLine($"Trace header is {traceHeaderStr}");

            if (string.IsNullOrWhiteSpace(traceHeaderStr))
            {
                return null;
            }

            return TraceHeader.FromString(traceHeaderStr).PickUpTrace(context);
        }

        public static IDisposable PickUpTrace(this TraceHeader traceHeader, ILambdaContext context)
        {
            var originalEntity = AWSXRayRecorder.Instance.GetEntity();

            using (DisableXRayIsLambdaCheck())
            {
                AWSXRayRecorder.Instance.BeginSegment(context.FunctionName,
                    traceId: traceHeader.RootTraceId,
                    parentId: traceHeader.ParentId,
                    samplingResponse: new SamplingResponse(traceHeader.Sampled));
            }

            var entity = (Segment)AWSXRayRecorder.Instance.GetEntity();
            entity.Origin = "AWS::Lambda::Function";
            entity.Aws.Add("function_arn", context.InvokedFunctionArn);
            entity.Aws.Add("resource_names", new[] { context.FunctionName });

            return new DisposableAction(() =>
            {
                using (DisableXRayIsLambdaCheck())
                {
                    AWSXRayRecorder.Instance.EndSegment();
                }

                AWSXRayRecorder.Instance.SetEntity(originalEntity);
            });
        }

        // we're trying to trick XRay recorder IsLambda checks to not get UnsupportedOperationException from AWS XRay SDK
        private static IDisposable DisableXRayIsLambdaCheck()
        {
            const string LambdaTaskRootKey = "LAMBDA_TASK_ROOT";

            var realValue = Environment.GetEnvironmentVariable(LambdaTaskRootKey);
            Environment.SetEnvironmentVariable(LambdaTaskRootKey, null);

            return new DisposableAction(() => Environment.SetEnvironmentVariable(LambdaTaskRootKey, realValue));
        }

        private class DisposableAction : IDisposable
        {
            private readonly Action _onDispose;

            public DisposableAction(Action onDispose)
            {
                _onDispose = onDispose;
            }

            public void Dispose()
            {
                _onDispose.Invoke();
            }
        }
    }
nimatt commented 3 years ago

@williamdenton @goacoustic That is what I did to begin with. But I stepped away from it due to that in my tests it didn't work if I didn't keep the environment variable changed during the entire subsection. That in it self is not an issue, but I do not know if there are any other parts of the SDK that are using that variable for something else and I didn't want to risk having something else misbehave due to this.

williamdenton commented 3 years ago

@srprash its been a few months since you last posted, any progress on an official response to this issue?

bharathshridhara commented 2 years ago

@srprash A friendly nudge to see where AWS are with this issue, please?

willarmiros commented 2 years ago

Hi @bharathshridhara - this issue is the primary tracking issue for this use case, and my latest comment is still the current status on the issue: https://github.com/aws/aws-xray-sdk-node/issues/208#issuecomment-1009335700