aws / aws-xray-sdk-go

AWS X-Ray SDK for the Go programming language.
Apache License 2.0
276 stars 117 forks source link

Memory leak with BeginSubsegment #364

Closed thinhlvv closed 1 year ago

thinhlvv commented 2 years ago

I'm getting memory leak issue when executing BeginSubSegment. Here is the profiling graph we got when the issue was happening.

AWS X-Ray SDK for Go: v1.7.0 Go version: go1.16.2 darwin/amd64

Screenshot 2022-05-18 at 3 44 31 PM Screenshot 2022-05-18 at 3 47 08 PM

nikolaianohyn commented 2 years ago

For sampled segments Close() function doesn't cancel context (!)

// check whether segment is dummy or not based on sampling decision
    if !seg.ParentSegment.Sampled {
        seg.Dummy = true
    }

    // create a new context to close it on segment close;
    // this makes sure the closing of the segment won't affect the client's code, that uses the returned context
    ctx1, cancelCtx := context.WithCancel(ctx)
    seg.cancelCtx = cancelCtx
    go func() {
        <-ctx1.Done()
        seg.handleContextDone()
    }()

Fragment from close function:

    if seg.Dummy {
        seg.Unlock()
        return
    }

    cancelSegCtx := seg.cancelCtx

    seg.Unlock()
    seg.send()

    // makes sure the goroutine, that waits for possible top-level context cancellation gets closed on segment close
    if cancelSegCtx != nil {
        cancelSegCtx()
    }

So if segment "Dummy" we have leak routine

stijndehaes commented 2 years ago

We are seeing memory leaks in production for one of our applications. Could these dummy segments be the culprit? I see most memory usage is in Xray for the moment, still need to investigate it further. If so could we perhaps have a release with the PR with the fix?

willarmiros commented 2 years ago

Hi @stijndehaes, thanks for following up. We will be cutting a release as soon as able to get this fix out.

stijndehaes commented 2 years ago

Hi @willarmiros, thank you very much! I yesterday just upgraded to the commit with the fix on main and saw immediate results, memory pressure was way down, so I can confirm it helps!

byrnedo commented 2 years ago

I'm seeing this too in production in v1.7.0. I create a cancellable context in the top level of my main and pass it into a message handler callback for messages received from a queue.

func main() {
    ctx, cncl := context.WithCancel(context.Background())
    defer cncl()
    ....

func createMessageCallback(ctx context.Context) func(msg *stan.Msg) {
        // called for every received message
    return func(msg *stan.Msg) {

        var seg *xray.Segment
        ctx, seg = xray.BeginSegment(ctx, "foo")
        defer seg.Close(nil)
        ...

Looking at pprof, it's leaking a goroutine for every message I handle:

goroutine profile: total 1088
10968 @ 0x43a456 0x406d1b 0x406818 0xb01f12 0x46b621
#   0xb01f11    github.com/aws/aws-xray-sdk-go/xray.BeginSegmentWithSampling.func1+0x31 /server/vendor/github.com/aws/aws-xray-sdk-go/xray/segment.go:144
stijndehaes commented 2 years ago

@byrnedo

As a temporary workaround in your go.mod file you can point to this commit by replacing:

require (
        ....
    github.com/aws/aws-xray-sdk-go v1.7.0
        ...
)

with:

require (
        ....
    github.com/aws/aws-xray-sdk-go v1.7.1-0.20220520211606-a735b941af42
        ...
)

That way you can already run the fix without having to wait for a new release

byrnedo commented 2 years ago

Thanks, I'll do that!

stijndehaes commented 1 year ago

Release 1.7.1 includes the fix for the memory leak. I believe this issue can be closed