elastic / apm-queue

Abstraction layer over Kafka / GCP PubSub Lite to produce and consume records
Apache License 2.0
6 stars 15 forks source link

Flaky pubsublite test #562

Closed rubvs closed 1 week ago

rubvs commented 2 weeks ago

While investigation https://github.com/elastic/apm-queue/issues/464 I came across a more subtle flaky test in the pubsublite package:

apm-queue [main] > make test
go test -race  -count=10 -timeout=60s ./...
?       github.com/elastic/apm-queue/v2 [no test files]
?       github.com/elastic/apm-queue/v2/cmd/queuebench  [no test files]
?       github.com/elastic/apm-queue/v2/metrictest  [no test files]
ok      github.com/elastic/apm-queue/v2/kafka   18.965s
ok      github.com/elastic/apm-queue/v2/pubsublite  3.949s
ok      github.com/elastic/apm-queue/v2/pubsublite/internal/pubsubabs   3.205s
--- FAIL: TestPublisher (0.02s)
    --- FAIL: TestPublisher/success_with_otel_attributes#01 (0.01s)
        publisher_test.go:266:
                Error Trace:    /Users/rubenvanstaden/github/elastic/apm-queue/pubsublite/internal/telemetry/publisher_test.go:266
                Error:          Not equal:
                                expected: tracetest.SpanStubs{tracetest.SpanStub{Name:"pubsublite.Publish", SpanContext:trace.SpanContext{traceID:trace.TraceID{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, spanID:trace.SpanID{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, traceFlags:0x0, traceState:trace.TraceState{list:[]trace.member(nil)}, remote:false}, Parent:trace.SpanContext{traceID:trace.TraceID{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, spanID:trace.SpanID{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, traceFlags:0x0, traceState:trace.TraceState{list:[]trace.member(nil)}, remote:false}, SpanKind:4, StartTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), EndTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Attributes:[]attribute.KeyValue{attribute.KeyValue{Key:"project", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"project_name", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.system", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"pubsublite", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.destination.kind", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"topic", slice:interface {}(nil)}}, attribute.KeyValue{Key:"key", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"value", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.message.id", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"msg-id", slice:interface {}(nil)}}}, Events:[]trace.Event(nil), Links:[]trace.Link(nil), Status:trace.Status{Code:0x2, Description:""}, DroppedAttributes:0, DroppedEvents:0, DroppedLinks:0, ChildSpanCount:0, Resource:(*resource.Resource)(nil), InstrumentationScope:instrumentation.Scope{Name:"test", Version:"", SchemaURL:""}, InstrumentationLibrary:instrumentation.Scope{Name:"", Version:"", SchemaURL:""}}}
                                actual  : tracetest.SpanStubs{}

                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,110 +1,2 @@
                                -(tracetest.SpanStubs) (len=1) {
                                - (tracetest.SpanStub) {
                                -  Name: (string) (len=18) "pubsublite.Publish",
                                -  SpanContext: (trace.SpanContext) {
                                -   traceID: (trace.TraceID) (len=16) {
                                -    00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
                                -   },
                                -   spanID: (trace.SpanID) (len=8) {
                                -    00000000  00 00 00 00 00 00 00 00                           |........|
                                -   },
                                -   traceFlags: (trace.TraceFlags) 0,
                                -   traceState: (trace.TraceState) {
                                -    list: ([]trace.member) <nil>
                                -   },
                                -   remote: (bool) false
                                -  },
                                -  Parent: (trace.SpanContext) {
                                -   traceID: (trace.TraceID) (len=16) {
                                -    00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
                                -   },
                                -   spanID: (trace.SpanID) (len=8) {
                                -    00000000  00 00 00 00 00 00 00 00                           |........|
                                -   },
                                -   traceFlags: (trace.TraceFlags) 0,
                                -   traceState: (trace.TraceState) {
                                -    list: ([]trace.member) <nil>
                                -   },
                                -   remote: (bool) false
                                -  },
                                -  SpanKind: (trace.SpanKind) 4,
                                -  StartTime: (time.Time) {
                                -   wall: (uint64) 0,
                                -   ext: (int64) 0,
                                -   loc: (*time.Location)(<nil>)
                                -  },
                                -  EndTime: (time.Time) {
                                -   wall: (uint64) 0,
                                -   ext: (int64) 0,
                                -   loc: (*time.Location)(<nil>)
                                -  },
                                -  Attributes: ([]attribute.KeyValue) (len=5) {
                                -   (attribute.KeyValue) {
                                -    Key: (attribute.Key) (len=7) "project",
                                -    Value: (attribute.Value) {
                                -     vtype: (attribute.Type) 4,
                                -     numeric: (uint64) 0,
                                -     stringly: (string) (len=12) "project_name",
                                -     slice: (interface {}) <nil>
                                -    }
                                -   },
                                -   (attribute.KeyValue) {
                                -    Key: (attribute.Key) (len=16) "messaging.system",
                                -    Value: (attribute.Value) {
                                -     vtype: (attribute.Type) 4,
                                -     numeric: (uint64) 0,
                                -     stringly: (string) (len=10) "pubsublite",
                                -     slice: (interface {}) <nil>
                                -    }
                                -   },
                                -   (attribute.KeyValue) {
                                -    Key: (attribute.Key) (len=26) "messaging.destination.kind",
                                -    Value: (attribute.Value) {
                                -     vtype: (attribute.Type) 4,
                                -     numeric: (uint64) 0,
                                -     stringly: (string) (len=5) "topic",
                                -     slice: (interface {}) <nil>
                                -    }
                                -   },
                                -   (attribute.KeyValue) {
                                -    Key: (attribute.Key) (len=3) "key",
                                -    Value: (attribute.Value) {
                                -     vtype: (attribute.Type) 4,
                                -     numeric: (uint64) 0,
                                -     stringly: (string) (len=5) "value",
                                -     slice: (interface {}) <nil>
                                -    }
                                -   },
                                -   (attribute.KeyValue) {
                                -    Key: (attribute.Key) (len=20) "messaging.message.id",
                                -    Value: (attribute.Value) {
                                -     vtype: (attribute.Type) 4,
                                -     numeric: (uint64) 0,
                                -     stringly: (string) (len=6) "msg-id",
                                -     slice: (interface {}) <nil>
                                -    }
                                -   }
                                -  },
                                -  Events: ([]trace.Event) <nil>,
                                -  Links: ([]trace.Link) <nil>,
                                -  Status: (trace.Status) {
                                -   Code: (codes.Code) 2,
                                -   Description: (string) ""
                                -  },
                                -  DroppedAttributes: (int) 0,
                                -  DroppedEvents: (int) 0,
                                -  DroppedLinks: (int) 0,
                                -  ChildSpanCount: (int) 0,
                                -  Resource: (*resource.Resource)(<nil>),
                                -  InstrumentationScope: (instrumentation.Scope) {
                                -   Name: (string) (len=4) "test",
                                -   Version: (string) "",
                                -   SchemaURL: (string) ""
                                -  },
                                -  InstrumentationLibrary: (instrumentation.Scope) {
                                -   Name: (string) "",
                                -   Version: (string) "",
                                -   SchemaURL: (string) ""
                                -  }
                                - }
                                +(tracetest.SpanStubs) {
                                 }
                Test:           TestPublisher/success_with_otel_attributes#01
FAIL
FAIL    github.com/elastic/apm-queue/v2/pubsublite/internal/telemetry   2.016s
ok      github.com/elastic/apm-queue/v2/queuecontext    3.056s
ok      github.com/elastic/apm-queue/v2/systemtest  4.429s
FAIL
make: *** [test] Error 1

After some investigation, the problem seems to be related to the child goroutine in https://github.com/elastic/apm-queue/blob/6675642e184c322eadac10a3c6155ed267fc8bdb/pubsublite/internal/telemetry/publisher.go#L114 not being synchronized with it's parent.

This test is really flaky. It's seems to only fail 1/10 runs at best.