IBM / sarama

Sarama is a Go library for Apache Kafka.
MIT License
11.57k stars 1.76k forks source link

Cluster admin retry on error implementation is confusing #2515

Closed hindessm closed 1 year ago

hindessm commented 1 year ago
Versions
Sarama Kafka Go
12c24a8 n/a 1.20.6
Problem Description

I was considering fixing some of the ClusterAdmin functions that seem to be missing the retry-on-ErrNoController logic. However, I noticed that the retryOnError implementation was a little surprising/confusing so I thought I'd ask about the intended behaviour before started improving the other methods.

Firstly, there is an off-by-one in the "X retries remaining" logs. So it reports " attempts remaining" after making one attempt and "1 attempts remaining" when it is going to make no further attempts.

Secondly, it sleeps after the final failed retry rather than immediately returning the failure which is surprising to me and should probably be documented if it is intentional or corrected.

I'd be happy to submit a PR to fix both of these issues but I wasn't sure what the preferred fix for the second issue would be?

You can see the odd behaviour with the following test.

func Test_retryOnError(t *testing.T) {
        config := NewTestConfig()
        config.Version = V1_0_0_0
        config.Admin.Retry.Max = 3
        config.Admin.Retry.Backoff = 100 * time.Millisecond

        var buf bytes.Buffer
        Logger = log.New(&buf, "[sarama] ", log.LstdFlags)

        admin := &clusterAdmin{conf: config}
        startTime := time.Now()
        admin.retryOnError(func(error) bool { return true }, func() error {
                return errors.New("mock error")
        })
        if strings.Contains(buf.String(), "3 attempts remaining") {
                t.Fatalf("incorrect number of attempts remaining reported; should be max of 2 since one attempt has already been made")
        }
        if time.Since(startTime) >= 300*time.Millisecond {
                t.Fatalf("attempt+sleep+attempt+sleep+attempt should take less than 3 * backoff time")
        }
}
dnwe commented 1 year ago

Both are undoubtedly just flaws in the implementation rather than intentional behaviour so fixes would be welcomed 😀

hindessm commented 1 year ago

@dnwe Thanks for the clarification. I'll submit a PR soon.

Bonus question: what does this continue achieve?