Closed rubvs closed 1 week ago
While investigation https://github.com/elastic/apm-queue/issues/464 I came across a more subtle flaky test in the pubsublite package:
pubsublite
apm-queue [main] > make test go test -race -count=10 -timeout=60s ./... ? github.com/elastic/apm-queue/v2 [no test files] ? github.com/elastic/apm-queue/v2/cmd/queuebench [no test files] ? github.com/elastic/apm-queue/v2/metrictest [no test files] ok github.com/elastic/apm-queue/v2/kafka 18.965s ok github.com/elastic/apm-queue/v2/pubsublite 3.949s ok github.com/elastic/apm-queue/v2/pubsublite/internal/pubsubabs 3.205s --- FAIL: TestPublisher (0.02s) --- FAIL: TestPublisher/success_with_otel_attributes#01 (0.01s) publisher_test.go:266: Error Trace: /Users/rubenvanstaden/github/elastic/apm-queue/pubsublite/internal/telemetry/publisher_test.go:266 Error: Not equal: expected: tracetest.SpanStubs{tracetest.SpanStub{Name:"pubsublite.Publish", SpanContext:trace.SpanContext{traceID:trace.TraceID{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, spanID:trace.SpanID{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, traceFlags:0x0, traceState:trace.TraceState{list:[]trace.member(nil)}, remote:false}, Parent:trace.SpanContext{traceID:trace.TraceID{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, spanID:trace.SpanID{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, traceFlags:0x0, traceState:trace.TraceState{list:[]trace.member(nil)}, remote:false}, SpanKind:4, StartTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), EndTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Attributes:[]attribute.KeyValue{attribute.KeyValue{Key:"project", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"project_name", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.system", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"pubsublite", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.destination.kind", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"topic", slice:interface {}(nil)}}, attribute.KeyValue{Key:"key", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"value", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.message.id", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"msg-id", slice:interface {}(nil)}}}, Events:[]trace.Event(nil), Links:[]trace.Link(nil), Status:trace.Status{Code:0x2, Description:""}, DroppedAttributes:0, DroppedEvents:0, DroppedLinks:0, ChildSpanCount:0, Resource:(*resource.Resource)(nil), InstrumentationScope:instrumentation.Scope{Name:"test", Version:"", SchemaURL:""}, InstrumentationLibrary:instrumentation.Scope{Name:"", Version:"", SchemaURL:""}}} actual : tracetest.SpanStubs{} Diff: --- Expected +++ Actual @@ -1,110 +1,2 @@ -(tracetest.SpanStubs) (len=1) { - (tracetest.SpanStub) { - Name: (string) (len=18) "pubsublite.Publish", - SpanContext: (trace.SpanContext) { - traceID: (trace.TraceID) (len=16) { - 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| - }, - spanID: (trace.SpanID) (len=8) { - 00000000 00 00 00 00 00 00 00 00 |........| - }, - traceFlags: (trace.TraceFlags) 0, - traceState: (trace.TraceState) { - list: ([]trace.member) <nil> - }, - remote: (bool) false - }, - Parent: (trace.SpanContext) { - traceID: (trace.TraceID) (len=16) { - 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| - }, - spanID: (trace.SpanID) (len=8) { - 00000000 00 00 00 00 00 00 00 00 |........| - }, - traceFlags: (trace.TraceFlags) 0, - traceState: (trace.TraceState) { - list: ([]trace.member) <nil> - }, - remote: (bool) false - }, - SpanKind: (trace.SpanKind) 4, - StartTime: (time.Time) { - wall: (uint64) 0, - ext: (int64) 0, - loc: (*time.Location)(<nil>) - }, - EndTime: (time.Time) { - wall: (uint64) 0, - ext: (int64) 0, - loc: (*time.Location)(<nil>) - }, - Attributes: ([]attribute.KeyValue) (len=5) { - (attribute.KeyValue) { - Key: (attribute.Key) (len=7) "project", - Value: (attribute.Value) { - vtype: (attribute.Type) 4, - numeric: (uint64) 0, - stringly: (string) (len=12) "project_name", - slice: (interface {}) <nil> - } - }, - (attribute.KeyValue) { - Key: (attribute.Key) (len=16) "messaging.system", - Value: (attribute.Value) { - vtype: (attribute.Type) 4, - numeric: (uint64) 0, - stringly: (string) (len=10) "pubsublite", - slice: (interface {}) <nil> - } - }, - (attribute.KeyValue) { - Key: (attribute.Key) (len=26) "messaging.destination.kind", - Value: (attribute.Value) { - vtype: (attribute.Type) 4, - numeric: (uint64) 0, - stringly: (string) (len=5) "topic", - slice: (interface {}) <nil> - } - }, - (attribute.KeyValue) { - Key: (attribute.Key) (len=3) "key", - Value: (attribute.Value) { - vtype: (attribute.Type) 4, - numeric: (uint64) 0, - stringly: (string) (len=5) "value", - slice: (interface {}) <nil> - } - }, - (attribute.KeyValue) { - Key: (attribute.Key) (len=20) "messaging.message.id", - Value: (attribute.Value) { - vtype: (attribute.Type) 4, - numeric: (uint64) 0, - stringly: (string) (len=6) "msg-id", - slice: (interface {}) <nil> - } - } - }, - Events: ([]trace.Event) <nil>, - Links: ([]trace.Link) <nil>, - Status: (trace.Status) { - Code: (codes.Code) 2, - Description: (string) "" - }, - DroppedAttributes: (int) 0, - DroppedEvents: (int) 0, - DroppedLinks: (int) 0, - ChildSpanCount: (int) 0, - Resource: (*resource.Resource)(<nil>), - InstrumentationScope: (instrumentation.Scope) { - Name: (string) (len=4) "test", - Version: (string) "", - SchemaURL: (string) "" - }, - InstrumentationLibrary: (instrumentation.Scope) { - Name: (string) "", - Version: (string) "", - SchemaURL: (string) "" - } - } +(tracetest.SpanStubs) { } Test: TestPublisher/success_with_otel_attributes#01 FAIL FAIL github.com/elastic/apm-queue/v2/pubsublite/internal/telemetry 2.016s ok github.com/elastic/apm-queue/v2/queuecontext 3.056s ok github.com/elastic/apm-queue/v2/systemtest 4.429s FAIL make: *** [test] Error 1
After some investigation, the problem seems to be related to the child goroutine in https://github.com/elastic/apm-queue/blob/6675642e184c322eadac10a3c6155ed267fc8bdb/pubsublite/internal/telemetry/publisher.go#L114 not being synchronized with it's parent.
This test is really flaky. It's seems to only fail 1/10 runs at best.
While investigation https://github.com/elastic/apm-queue/issues/464 I came across a more subtle flaky test in the
pubsublite
package:After some investigation, the problem seems to be related to the child goroutine in https://github.com/elastic/apm-queue/blob/6675642e184c322eadac10a3c6155ed267fc8bdb/pubsublite/internal/telemetry/publisher.go#L114 not being synchronized with it's parent.
This test is really flaky. It's seems to only fail 1/10 runs at best.