aws / aws-xray-sdk-go

AWS X-Ray SDK for the Go programming language.
Apache License 2.0
276 stars 118 forks source link

Segment silently fails to flush when a subsegment is created with a closed parent #456

Open danielwhite opened 4 months ago

danielwhite commented 4 months ago

If a subsegment is created where the parent segment has been closed, then the entire trace segment will fail to be flushed.

The underlying reason seems to be that Segment.openSegments becomes negative resulting in Segment.flush ignoring the segment because it will only handle a segment with exactly zero openSegments.

https://github.com/aws/aws-xray-sdk-go/blob/a0604c52d192cc836b6bfbdd7bdde6a1ba579746/xray/segment.go#L506-L507

The current silent loss of the entire trace segment is very surprising and difficult to diagnose.

Example

A more complete test case demonstrates this:

func TestSegment_Close_out_of_order(t *testing.T) {
    ctx, td := NewTestDaemon()
    defer td.Close()
    ctx, seg := BeginSegment(ctx, "test")

    ctxSubSeg1, subSeg1 := BeginSubsegment(ctx, "TestSubsegment1")
    subSeg1.Close(nil) // closed before it is used as a parent
    _, subSeg2 := BeginSubsegment(ctxSubSeg1, "TestSubsegment2")
    subSeg2.Close(nil)

    seg.Close(nil)

    assert.Zero(t, seg.openSegments)
    assert.True(t, seg.Emitted)
}

Currently, this will fail with the following:

=== RUN   TestSegment_Close_out_of_order
2024-03-08T01:31:26+11:00 [INFO] Emitter using address: 127.0.0.1:32794
    segment_test.go:170: 
            Error Trace:    segment_test.go:170
            Error:          Should be zero, but was -1
            Test:           TestSegment_Close_out_of_order
    segment_test.go:171: 
            Error Trace:    segment_test.go:171
            Error:          Should be true
            Test:           TestSegment_Close_out_of_order
--- FAIL: TestSegment_Close_out_of_order (0.00s)
danielwhite commented 4 months ago

I'd be happy to open a PR, but I'm not entirely sure the right course of action.

The Java SDK seems to handle a similar situation with a "NoOp" segment.

Something like the following seems to fix both the test case in the description and the issue I'm seeing in practice:

diff --git a/xray/segment.go b/xray/segment.go
index 4e96b34..f9e1507 100644
--- a/xray/segment.go
+++ b/xray/segment.go
@@ -314,6 +314,9 @@ func BeginSubsegment(ctx context.Context, name string) (context.Context, *Segmen
        if !seg.ParentSegment.Sampled {
                seg.Dummy = true
        }
+       if !seg.parent.InProgress {
+               seg.Dummy = true
+       }

I'm not sure if this is the best way to express this or if it risks hiding the problem.