aws / aws-xray-sdk-go

AWS X-Ray SDK for the Go programming language.
Apache License 2.0
278 stars 117 forks source link

Intermittent ExceptionFromError Failures #255

Closed mousedownmike closed 4 years ago

mousedownmike commented 4 years ago

I've been hesitating opening an issue for this because it is so intermittent but it has appeared in both unit test runs and in production runs. The fault occurs at various places in my code but it always stems from calling xray.AddError(ctx, e) in a manner similar to:

func Example() {
    ctx, xs := xray.BeginSubsegment(ctx, "Example")
    workerCount := 100

    rs := my.Records()

    var wg sync.WaitGroup
    rc := make(chan *my.Record)
    for i := 0; i < workerCount; i++ {
        wg.Add(1)
        go func() {
            for {
                r, more := <-rc
                if more {
                    e := my.Queue(ctx, Message(r))
                    if e != nil {
                        log.Error().
                            Err(e).
                            Str("record_number", r.Number).
                            Msg("error queueing record")
                        _ = xray.AddError(ctx, e)
                    }
                } else {
                    wg.Done()
                    break
                }
            }
        }()
    }

    for _, s := range rs {
        rc <- s
    }
    close(rc)
    wg.Wait()
    xs.Close(nil)
}

That is being called inside of a go func() that is part of a WaitGroup reading records off a channel (I've updated the example to more clearly articulate this). Is it possible there is a concurrency problem with my call to AddError? Should I be doing that in a different way? Note that this code is being run in a Lambda function so it runs inside the Facade segment.

This is consistently the trace of the exception.

github.com/aws/aws-xray-sdk-go/strategy/exception.(*DefaultFormattingStrategy).ExceptionFromError(0xc0000a4d00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /home/mike/go/pkg/mod/github.com/aws/aws-xray-sdk-go@v1.1.0/strategy/exception/default_exception_formatting_strategy.go:132 +0x186
github.com/aws/aws-xray-sdk-go/xray.(*Segment).addError(0xc00043b980, 0x0, 0x0)
    /home/mike/go/pkg/mod/github.com/aws/aws-xray-sdk-go@v1.1.0/xray/segment.go:593 +0x112
github.com/aws/aws-xray-sdk-go/xray.(*Segment).AddError(0xc00043b980, 0x0, 0x0, 0x0, 0x0)
    /home/mike/go/pkg/mod/github.com/aws/aws-xray-sdk-go@v1.1.0/xray/segment.go:585 +0x7e
github.com/aws/aws-xray-sdk-go/xray.AddError(0xa58b60, 0xc000407410, 0x0, 0x0, 0x3, 0xc00029ea80)
    /home/mike/go/pkg/mod/github.com/aws/aws-xray-sdk-go@v1.1.0/xray/context.go:104 +0x87
wangzlei commented 4 years ago

Hi, Thanks for diving xray sdk. The code seems not safe because there might be race condition that more than one goroutines call xray.AddError(ctx, e) at the same time. Could you try to replace line _ = xray.AddError(ctx, e) to add e into another channel, then raise a new goroutine to poll Err from this channel and call _ = xray.AddError(ctx, e) ? by this way it makes sure no race condition in xray.AddError

Thanks

mousedownmike commented 4 years ago

@wangzlei thanks for the reply. Ok, that makes sense. I don't think I get enough value from the additional error metadata in my X-Ray trace to go through that effort, I can easily extract that data from my logs.

Here's a quick question before I close this issue. Are any of the SDK functions considered safe for concurrent use in Go? I don't see anything in the docs one way or the other.

wangzlei commented 4 years ago

Sorry that in doc we don't mention too much about sdk's concurrency safety. In goLang world by default we suppose the sdk is not concurrency safety and have to use Channel or lock to handle multi-threads case, unless explicitly say it is concurrency safety.

mousedownmike commented 4 years ago

Thanks @wangzlei