elastic / apm-queue

Abstraction layer over Kafka / GCP PubSub Lite to produce and consume records
Apache License 2.0
6 stars 15 forks source link

Tests: flaky failures #464

Open inge4pres opened 4 months ago

inge4pres commented 4 months ago

Describe the current state

During the development of #447 I noticed some flaky tests, passing locally most of the times but not in CI:

Describe the desired state

The tests are consistently reporting either a failure or a success.

inge4pres commented 4 months ago

For TestProducerMetrics: running it with

 go test -v -count=10 -run="TestProducerMetrics/DeadlineExceeded" ./kafka/...

most certainly produces a panic

=== RUN   TestProducerMetrics/DeadlineExceeded
    metrics_test.go:65: 
                Error Trace:    /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:65
                                                        /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:123
                Error:          Not equal: 
                                expected: 2
                                actual  : 1
                Test:           TestProducerMetrics/DeadlineExceeded
    metrics_test.go:67: producer.messages.count
    metrics_test.go:71: [Name not equal:
        expected: messaging.kafka.connect_errors.count
        actual: producer.messages.count Description not equal:
        expected: Total number of connection errors, by broker
        actual: The number of messages produced Metrics Data not equal: Sum[int64] not equal: Sum DataPoints not equal:
        missing expected values:
        metricdata.DataPoint[int64]{Attributes:attribute.Set{equivalent:attribute.Distinct{iface:[1]attribute.KeyValue{attribute.KeyValue{Key:"messaging.system", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"kafka", slice:interface {}(nil)}}}}}, StartTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Time:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Value:1, Exemplars:[]metricdata.Exemplar[int64](nil)}
        unexpected additional values:
        metricdata.DataPoint[int64]{Attributes:attribute.Set{equivalent:attribute.Distinct{iface:[7]attribute.KeyValue{attribute.KeyValue{Key:"error_reason", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"timeout", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.destination.name", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"default-topic", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.kafka.destination.partition", Value:attribute.Value{vtype:2, numeric:0x0, stringly:"", slice:interface {}(nil)}}, attribute.KeyValue{Key:"messaging.system", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"kafka", slice:interface {}(nil)}}, attribute.KeyValue{Key:"namespace", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"name_space", slice:interface {}(nil)}}, attribute.KeyValue{Key:"outcome", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"failure", slice:interface {}(nil)}}, attribute.KeyValue{Key:"topic", Value:attribute.Value{vtype:4, numeric:0x0, stringly:"name_space-default-topic", slice:interface {}(nil)}}}}}, StartTime:time.Date(2024, time.May, 13, 17, 24, 7, 997820327, time.Local), Time:time.Date(2024, time.May, 13, 17, 24, 7, 998078794, time.Local), Value:3, Exemplars:[]metricdata.Exemplar[int64](nil)}
        ]
--- FAIL: TestProducerMetrics (0.00s)
    --- FAIL: TestProducerMetrics/DeadlineExceeded (0.00s)
panic: runtime error: index out of range [1] with length 1 [recovered]
        panic: runtime error: index out of range [1] with length 1

goroutine 7 [running]:
testing.tRunner.func1.2({0xdab780, 0xc000338078})
        /usr/local/go/src/testing/testing.go:1545 +0x238
testing.tRunner.func1()
        /usr/local/go/src/testing/testing.go:1548 +0x397
panic({0xdab780?, 0xc000338078?})
        /usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/elastic/apm-queue/v2/kafka.TestProducerMetrics.func1({0xf62858, 0xc00024a5b0}, 0xc0001ff6c0, {0xf604c8, 0xc000180c60}, {0xf62c80, 0xc0000bad20}, {0xc000055a80, 0x2, 0x2}, ...)
        /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:70 +0x6be
github.com/elastic/apm-queue/v2/kafka.TestProducerMetrics.func2(0xc0001ff6c0?)
        /home/francesco/Projects/github.com/elastic/apm-queue/kafka/metrics_test.go:123 +0xd68
testing.tRunner(0xc0001ff6c0, 0xc000053590)
        /usr/local/go/src/testing/testing.go:1595 +0xff
created by testing.(*T).Run in goroutine 6
        /usr/local/go/src/testing/testing.go:1648 +0x3ad
FAIL    github.com/elastic/apm-queue/v2/kafka   0.008s
FAIL

From what I could see, it appears that the commit 49cfc505a0dcf1491b04854bae1d8b3801158184 introduced even more flakyness.

The offending line i if !slices.Contains(ignore, metrics[i].Name) { because when the returned metrics are less than want, we can still get to the loop (require would halt the test, but assert is used so we keep executing).

kyungeunni commented 3 months ago

One more flaky test I've discovered, TestTopicCreatorCreateTopics fails occasionally, with logs of data race detected. example: https://github.com/elastic/apm-queue/actions/runs/9282217459/job/25539755922#step:4:1059

simitt commented 1 month ago

@rubvs can you confirm that this can be closed now that https://github.com/elastic/apm-queue/pull/537 got merged?

rubvs commented 1 month ago

TestProducerMetrics are still failing with:

go test -v -count=10 -failfast ./kafka/...

However, TestConsumerGracefulShutdown seems to be fixed.