confluentinc / confluent-kafka-go

Confluent's Apache Kafka Golang client
Apache License 2.0
4.64k stars 658 forks source link

Timeout logs are not consistent between stderr and logs channel #1114

Open vcianciarelli opened 10 months ago

vcianciarelli commented 10 months ago

Description

When enabling logs channel, the timeout (REQTMOUT) logs stopped showing up.

"go.logs.channel.enable": true,

How to reproduce

Without logs channel Using this kafka configuration, we see the timeout logs all the time in our logs:

kpPrimary, err = kafka.NewProducer(&kafka.ConfigMap{
    "bootstrap.servers":            conf.KafkaEndpoint,
    "compression.type":             "snappy",
    "queue.buffering.max.messages": 1000000,
    "linger.ms":                    50,
    "batch.size":                   1000000,
    "queue.buffering.max.kbytes":   10485760,
})
%5|1702394241.177|REQTMOUT|rdkafka#producer-1| [thrd:main/1]: main/1: Timed out ProduceRequest in flight (after 60078ms, timeout #0)
%5|1702394241.177|REQTMOUT|rdkafka#producer-1| [thrd:main/1]: main/1: Timed out ProduceRequest in flight (after 60054ms, timeout #1)
%5|1702394241.177|REQTMOUT|rdkafka#producer-1| [thrd:main/1]: main/1: Timed out ProduceRequest in flight (after 60027ms, timeout #2)
%5|1702394241.177|REQTMOUT|rdkafka#producer-1| [thrd:main/1]: main/1: Timed out ProduceRequest in flight (after 60000ms, timeout #3)
%4|1702394241.178|REQTMOUT|rdkafka#producer-1| [thrd:main/1]: main/1: Timed out 4 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
%3|1702394241.178|FAIL|rdkafka#producer-1| [thrd:main/1]: main/1: 4 request(s) timed out: disconnect (after 295933ms in state UP)

With logs channel After enabling logs channel with go.logs.channel.enable flag, we were no longer seeing the timeout logs:

kpPrimary, err = kafka.NewProducer(&kafka.ConfigMap{
    "bootstrap.servers":            conf.KafkaEndpoint,
    "compression.type":             "snappy",
    "queue.buffering.max.messages": 1000000,
    "linger.ms":                    50,
    "batch.size":                   1000000,
    "queue.buffering.max.kbytes":   10485760,
    "go.logs.channel.enable":       true,
})

Additional details To confirm that we've implemented the logs channel correctly, I added debug flag to our configuration:

kpPrimary, err = kafka.NewProducer(&kafka.ConfigMap{
    "bootstrap.servers":            conf.KafkaEndpoint,
    "compression.type":             "snappy",
    "queue.buffering.max.messages": 1000000,
    "linger.ms":                    50,
    "batch.size":                   1000000,
    "queue.buffering.max.kbytes":   10485760,
    "go.logs.channel.enable":       true,
    "debug":                        "all",
})

And we saw all the logs except the timeout logs. Here are some sample logs:

[2023-12-07T17:02:55Z][rdkafka#producer-1][STATE][7][thrd:main/21]: main/21: Broker changed state APIVERSION_QUERY -> UP
[2023-12-07T17:02:55Z][rdkafka#producer-1][TOPPAR][7][thrd:main/16]: main/16: events.v2 [155] 1 message(s) queued but broker not up
[2023-12-07T17:02:55Z][rdkafka#producer-1][FEATURE][7][thrd:main/16]: main/16: Updated enabled protocol features +ApiVersion to ApiVersion
[2023-12-07T17:02:55Z][rdkafka#producer-1][CONNECTED][7][thrd:main/16]: main/16: Connected (#1)

Checklist

Please provide the following information:

milindl commented 10 months ago

Just a question before I try to reproduce this: In case of the log channel, are the messages still timing out, and just that the log isn't there? Or they stop timing out entirely? The former seems more likely, but just in case I've missed something.

vcianciarelli commented 10 months ago

Hey @milindl

ya the messages are still timing out, and the logs aren't there anymore.

I've enabled logs channel on a canary pod and compared that to other pods not using logs channel. Other pods showed that we still have messages timing out.