confluentinc / confluent-kafka-dotnet

Confluent's Apache Kafka .NET client
https://github.com/confluentinc/confluent-kafka-dotnet/wiki
Apache License 2.0
80 stars 866 forks source link

Timeout Issues with KafkaEO & Transactions #1381

Closed ReeceKindon closed 2 years ago

ReeceKindon commented 4 years ago

Description

Since introducing Kafka Exactly Once semantics and Transactions into our platform we have recently seen a sharp rise in errors when trying to use the Transactional API. The most common errors we see are:

Failed to initialise Producer ID: Local: Timed out' with Exception
Operation not valid in state WaitPID, at Confluent.Kafka.Impl.SafeKafkaHandle.BeginTransaction() at Confluent.Kafka.Producer`2.BeginTransaction()
Operation not valid in state AbortableError, at Confluent.Kafka.Impl.SafeKafkaHandle.CommitTransaction(Int32 millisecondsTimeout) at Confluent.Kafka.Producer`2.CommitTransaction(TimeSpan timeout)

All of which seem to indicate a long delay in the Broker/Transaction Coordinator responding back to the Producer's request to Begin a Transaction.

We had several weeks during testing with little to no occurrences of the above but since going Live (just over 2 weeks ago) we have noticed the numbers starting to rise, with them being present in almost all of our Microservices all of the time now. As part of our EO implementation we have logic to Abort the Transaction and retry if such Exceptions are caught, but in recent examples it seems even that is not working as we would expect. Possibly because of a disconnect between the Producer and Broker?

From looking at our logs we can see in some scenarios it's taking > 60 seconds to just Init & Begin our Transactions, something that was previously a quick and simple step. This has slowly deteriorated our Microservices that are using the EO & Transactional API to the point where almost all are reporting the above errors and seemingly unable to recover.

How to reproduce

We Init & Begin Transaction as normal, generally passing a transactionTimeout of 60,000ms or more. Timeouts are happening when calling Begin & Commit Transaction.

//Init tranactions
_kafkaProducer.InitTransactions(new TimeSpan(0, 0, 0, 0, transactionTimeout));
_kafkaProducer.BeginTransaction();

//Commit Transaction
_kafkaProducer.CommitTransaction(new TimeSpan(0, 0, 0, 0, transactionTimeout));

Checklist

Please provide the following information:

Below are some logs taken from one of our Microservices, which seems to indicate the Broker being down:

%7|1597134906.825|STATE|rdkafka#producer-4| [thrd:TxnCoordinator]: TxnCoordinator: Broker changed state UP -> DOWN %7|1597134906.825|STATE|rdkafka#producer-4| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: Broker changed state UP -> DOWN %7|1597134906.825|STATE|rdkafka#producer-4| [thrd::0/internal]: :0/internal: Broker changed state INIT -> DOWN %7|1597134906.825|BRKTERM|rdkafka#producer-4| [thrd::0/internal]: :0/internal: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s) %7|1597134906.825|BRKTERM|rdkafka#producer-4| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s) %7|1597134906.825|BRKTERM|rdkafka#producer-4| [thrd:TxnCoordinator]: TxnCoordinator/5: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s) %7|1597134906.825|TERMINATE|rdkafka#producer-4| [thrd::0/internal]: :0/internal: Handle is terminating in state DOWN: 1 refcnts (0x256c870), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf %7|1597134906.825|BROKERFAIL|rdkafka#producer-4| [thrd::0/internal]: :0/internal: failed: err: Local: Broker handle destroyed: (errno: Success) %7|1597134906.825|TERMINATE|rdkafka#producer-4| [thrd:TxnCoordinator]: TxnCoordinator/5: Handle is terminating in state DOWN: 1 refcnts (0x7fa598002b70), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf %7|1597134906.825|TERMINATE|rdkafka#producer-4| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: Handle is terminating in state DOWN: 1 refcnts (0x257a600), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf %7|1597134906.825|BROKERFAIL|rdkafka#producer-4| [thrd:TxnCoordinator]: TxnCoordinator: failed: err: Local: Broker handle destroyed: (errno: Success) %7|1597134906.825|BROKERFAIL|rdkafka#producer-4| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: failed: err: Local: Broker handle destroyed: (errno: Success))

mhowlett commented 4 years ago

what is your throughput? (is it very low?)

ReeceKindon commented 4 years ago

@mhowlett I guess it is. Looking at our throughput for the last 24 hours; when most active it's around 150-300KB/s, but for the most part there's little to no activity.

mhowlett commented 4 years ago

the reason I ask is i'm wondering if it's related to this: https://github.com/edenhill/librdkafka/issues/2845#issuecomment-625433415

ReeceKindon commented 4 years ago

@mhowlett yeah it is interesting you point that out, I guess for the most part our Producers do sit in an idle state. We see a lot of the below messages in the Logs, but have just been ignoring them for now. When the SVC receives a new message it will then call "InitTransactions" again and Producer seems to spring back into life.

%5|1597162799.205|REQTMOUT|rdkafka#producer-2| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: Timed out MetadataRequest in flight (after 120689ms, timeout #0) %4|1597162799.205|REQTMOUT|rdkafka#producer-2| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests -- KAFKA PRODUCER ERROR -- {"Code":-185,"IsFatal":false,"Reason":"sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: 1 request(s) timed out: disconnect (after 598554ms in state UP)","IsError":true,"IsLocalError":true,"IsBrokerError":false}

mhowlett commented 4 years ago

I think it's worth upgrading to v1.5 and seeing how that goes. there was a change to help mitigate this (use least idle connection when choosing a random broker connection for e.g. metadata requests), which might help a lot, though it probably won't completely mitigate. it would be interesting to see how that goes as a first step.

chrisaddams commented 4 years ago

Thanks @mhowlett we will try upgrading to 1.5 tomorrow. We're currently on 1.4.4. We are using a standard loadbalancer on AKS. TCP reset was disabled. We enabled that and ran a test as it was quicker than bumping the version, but it seems to have regressed the performance, our logs are now full of timeouts. eg:

%7|1597173460.023|BRKTERM|rdkafka#producer-166| [thrd:sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4]: sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s)
%7|1597173460.022|TERM|rdkafka#producer-166| [thrd:sasl_ssl://b9-{key}.westeurope.azure.confluent.cloud:9092/9]: sasl_ssl://b9-{key}.westeurope.azure.confluent.cloud:9092/9: Received TERMINATE op in state INIT: 2 refcnts, 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs
%7|1597173460.023|BROKERFAIL|rdkafka#producer-166| [thrd:sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/11: failed: err: Local: Broker handle destroyed: (errno: Resource temporarily unavailable)
%7|1597173460.022|TOPBRK|rdkafka#producer-166| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: Topic event-case-upsert [1]: leaving broker (0 messages in xmitq, next broker (none), rktp 0x7f4c601952c0)
%7|1597173460.023|TOPBRK|rdkafka#producer-166| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: Topic event-case-upsert [1]: no next broker, failing 0 message(s) in partition queue
%7|1597173460.023|TERM|rdkafka#producer-166| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: Received TERMINATE op in state UP: 2 refcnts, 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs
%7|1597173460.023|BROKERFAIL|rdkafka#producer-166| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: failed: err: Local: Broker handle destroyed: (errno: Resource temporarily unavailable)
%7|1597173460.023|FAIL|rdkafka#producer-166| [thrd:sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/11: Client is terminating (after 232ms in state UP)
%7|1597173460.023|STATE|rdkafka#producer-166| [thrd:sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/11: Broker changed state UP -> DOWN
%7|1597173460.023|BRKTERM|rdkafka#producer-166| [thrd:sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/11: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s)

And

%5|1597174395.886|REQTMOUT|rdkafka#producer-2| [thrd:sasl_ssl://b5-{key}.westeurope.azure.confluent.cloud:9092/5]: sasl_ssl://b5-{key}.westeurope.azure.confluent.cloud:9092/5: Timed out MetadataRequest in flight (after 60990ms, timeout #0)
%4|1597174395.886|REQTMOUT|rdkafka#producer-2| [thrd:sasl_ssl://b5-{key}.westeurope.azure.confluent.cloud:9092/5]: sasl_ssl://b5-{key}.westeurope.azure.confluent.cloud:9092/5: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
-- KAFKA PRODUCER ERROR --
{"Code":-185,"IsFatal":false,"Reason":"sasl_ssl://b5-{key}.westeurope.azure.confluent.cloud:9092/5: 1 request(s) timed out: disconnect (after 1197241ms in state UP)","IsError":true,"IsLocalError":true,"IsBrokerError":false}
%5|1597174397.380|REQTMOUT|rdkafka#consumer-3| [thrd:sasl_ssl://b0-{key}.westeurope.azure.confluent.cloud:9092/0]: sasl_ssl://b0-{key}.westeurope.azure.confluent.cloud:9092/0: Timed out MetadataRequest in flight (after 60758ms, timeout #0)
%4|1597174397.380|REQTMOUT|rdkafka#consumer-3| [thrd:sasl_ssl://b0-{key}.westeurope.azure.confluent.cloud:9092/0]: sasl_ssl://b0-{key}.westeurope.azure.confluent.cloud:9092/0: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
-- KAFKA CONSUMER ERROR --
{"Code":-185,"IsFatal":false,"Reason":"sasl_ssl://b0-{key}.westeurope.azure.confluent.cloud:9092/0: 1 request(s) timed out: disconnect (after 366336ms in state UP)","IsError":true,"IsLocalError":true,"IsBrokerError":false}
%5|1597174413.109|REQTMOUT|rdkafka#consumer-3| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: Timed out FindCoordinatorRequest in flight (after 60733ms, timeout #0)
%4|1597174413.109|REQTMOUT|rdkafka#consumer-3| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
-- KAFKA CONSUMER ERROR --
{"Code":-185,"IsFatal":false,"Reason":"sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: 1 request(s) timed out: disconnect (after 494464ms in state UP)","IsError":true,"IsLocalError":true,"IsBrokerError":false}
%5|1597174457.742|REQTMOUT|rdkafka#consumer-3| [thrd:sasl_ssl://b3-{key}.westeurope.azure.confluent.cloud:9092/3]: sasl_ssl://b3-{key}.westeurope.azure.confluent.cloud:9092/3: Timed out MetadataRequest in flight (after 60361ms, timeout #0)
%4|1597174457.742|REQTMOUT|rdkafka#consumer-3| [thrd:sasl_ssl://b3-{key}.westeurope.azure.confluent.cloud:9092/3]: sasl_ssl://b3-{key}.westeurope.azure.confluent.cloud:9092/3: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
-- KAFKA CONSUMER ERROR --
{"Code":-185,"IsFatal":false,"Reason":"sasl_ssl://b3-{key}.westeurope.azure.confluent.cloud:9092/3: 1 request(s) timed out: disconnect (after 660609ms in state UP)","IsError":true,"IsLocalError":true,"IsBrokerError":false}
%5|1597174473.357|REQTMOUT|rdkafka#consumer-3| [thrd:sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/11: Timed out FindCoordinatorRequest in flight (after 60247ms, timeout #0)
%4|1597174473.357|REQTMOUT|rdkafka#consumer-3| [thrd:sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/11: Timed out 1 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
-- KAFKA CONSUMER ERROR --
mhowlett commented 4 years ago
%7|1597173460.023|BRKTERM|rdkafka#producer-166| [thrd:sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4]: sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s)

are you seeing lines like the above with refcnt > 0 corresponding to the same broker many times (here thrd:sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4 is broker 4). I assume you have replaced your cluster with {key}.

if so that's a bug.

ReeceKindon commented 4 years ago

Ye we replaced our cluster with {key}.

Stepping back through the logs it seems to cycle through the Brokers, so we will see errors like the above with b4-{key} etc. Then once "Broker handle destroyed" a new Broker will report the same issue, this repeats for some time:

%7|1597216577.956|BRKTERM|rdkafka#producer-4| [thrd:sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4]: sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s)
%7|1597216577.956|TERMINATE|rdkafka#producer-4| [thrd:sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4]: sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4: Handle is terminating in state DOWN: 1 refcnts (0x7fdd100b3f50), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf
%7|1597216577.956|BROKERFAIL|rdkafka#producer-4| [thrd:sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4]: sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4: failed: err: Local: Broker handle destroyed: (errno: Success)

Then switches to b6 ...

%7|1597216577.956|FAIL|rdkafka#producer-4| [thrd:sasl_ssl://b6-{key}.westeurope.azure.confluent.cloud:9092/6]: sasl_ssl://b6-{key}.westeurope.azure.confluent.cloud:9092/6: Client is terminating (after 172ms in state UP)

These logs then follow ...

%7|1597216577.954|FAIL|rdkafka#producer-4| [thrd:TxnCoordinator]: TxnCoordinator: Client is terminating (after 800ms in state UP)
%7|1597216577.956|STATE|rdkafka#producer-4| [thrd:TxnCoordinator]: TxnCoordinator: Broker changed state UP -> DOWN
%7|1597216577.956|BRKTERM|rdkafka#producer-4| [thrd:TxnCoordinator]: TxnCoordinator/3: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s)
%7|1597216577.956|TERMINATE|rdkafka#producer-4| [thrd:TxnCoordinator]: TxnCoordinator/3: Handle is terminating in state DOWN: 1 refcnts (0x7fdd10009bb0), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf

Then back to b6

%7|1597216577.956|BRKTERM|rdkafka#producer-4| [thrd:sasl_ssl://b6-{key}.westeurope.azure.confluent.cloud:9092/6]: sasl_ssl://b6-{key}.westeurope.azure.confluent.cloud:9092/6: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s)
ReeceKindon commented 4 years ago

@mhowlett we have now upgraded to Confluent.Kafka v1.5 and with TCP reset still enabled, we have had some success in tests ran this morning. We could see all messages delivered as expected and reduced number of timeouts in the logs.

We did still encounter some transaction timeouts (see below), which is a concern. However, the SVC's were able to Abort the Transaction and seemingly recover. Do you think there is still a bug here?

%7|1597225038.122|TOPPAR|rdkafka#producer-3| [thrd:sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1]: sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1: cmd-case-upsert [0] 7 message(s) queued but broker not up 
%7|1597225038.122|TOPPAR|rdkafka#producer-3| [thrd:sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1]: sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1: cmd-case-upsert [3] 5 message(s) queued but broker not up 
%7|1597225038.122|SASLMECHS|rdkafka#producer-3| [thrd:sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1]: sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1: Broker supported SASL mechanisms: PLAIN,OAUTHBEARER 
%7|1597225038.122|AUTH|rdkafka#producer-3| [thrd:sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1]: sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1: Auth in state AUTH_HANDSHAKE (handshake supported) 
%7|1597225038.122|STATE|rdkafka#producer-3| [thrd:sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1]: sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1: Broker changed state AUTH_HANDSHAKE -> AUTH_REQ 
%7|1597225038.122|TOPPAR|rdkafka#producer-3| [thrd:sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1]: sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1: cmd-case-upsert [0] 7 message(s) queued but broker not up 
%7|1597225038.122|TOPPAR|rdkafka#producer-3| [thrd:sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1]: sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1: cmd-case-upsert [3] 5 message(s) queued but broker not up 
%7|1597225038.138|TOPPAR|rdkafka#producer-3| [thrd:sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1]: sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1: cmd-case-upsert [3] 5 message(s) queued but broker not up 
%7|1597225038.138|TOPPAR|rdkafka#producer-3| [thrd:sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1]: sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1: cmd-case-upsert [0] 7 message(s) queued but broker not up 
%7|1597225038.138|STATE|rdkafka#producer-3| [thrd:sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1]: sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1: Broker changed state AUTH_REQ -> UP 
%4|1597225156.781|MAXPOLL|rdkafka#consumer-2| [thrd:main]: Application maximum poll interval (120000ms) exceeded by 203ms (adjust max.poll.interval.ms for long-running message processing): leaving group 
[09:39:17 ERR] Error in CaseProcessor Transactional operation timed out, at Confluent.Kafka.Impl.SafeKafkaHandle.SendOffsetsToTransaction(IEnumerable`1 offsets, IConsumerGroupMetadata groupMetadata, Int32 millisecondsTimeout) 
 at Confluent.Kafka.Producer`2.SendOffsetsToTransaction(IEnumerable`1 offsets, IConsumerGroupMetadata groupMetadata, TimeSpan timeout) 
 at BristowSutorGroup.Confluent.Producer.CommitOffsetsAndTransaction(List`1 offsets, IConsumerGroupMetadata consumerGroupMetadata, Int32 transactionTimeout) 
 at CaseProcessor.ProducerHelper.CommitOffsetsAndTransaction(List`1 offsets, IConsumerGroupMetadata consumerGroupMetadata, Int32 transactionTimeout) in /src/Helpers/KafkaHelper.cs:line 102 
 at CaseProcessor.Program.MessageReceived(Object sender, MessageReceivedEventArgs message) in /src/Program.cs:line 146
%7|1597225157.413|TXNAPI|rdkafka#producer-3| [thrd:app]: Transactional API called: abort_transaction 
%7|1597225157.441|TXNCOMPLETE|rdkafka#producer-3| [thrd:main]: Transaction successfully aborted 
%7|1597225157.441|TXNSTATE|rdkafka#producer-3| [thrd:main]: Transaction state change AbortingTransaction -> Ready 
[09:39:17 INF] -- DISPOSE PRODUCER 
%7|1597225157.487|DESTROY|rdkafka#producer-3| [thrd:app]: Terminating instance (destroy flags none (0x0)) 
%7|1597225157.487|IDEMPSTATE|rdkafka#producer-3| [thrd:main]: Idempotent producer state change Assigned -> Terminate 
%7|1597225157.487|DESTROY|rdkafka#producer-3| [thrd:main]: Destroy internal 
%7|1597225157.487|DESTROY|rdkafka#producer-3| [thrd:main]: Removing all topics 
%7|1597225157.487|DESTROY|rdkafka#producer-3| [thrd:main]: Sending TERMINATE to sasl_ssl://b3-{key}.westeurope.azure.confluent.cloud:9092/3 
%7|1597225157.487|DESTROY|rdkafka#producer-3| [thrd:main]: Sending TERMINATE to sasl_ssl://b7-{key}.westeurope.azure.confluent.cloud:9092/7 
%7|1597225157.487|TOPBRK|rdkafka#producer-3| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: Topic cmd-case-upsert [1]: leaving broker (0 messages in xmitq, next broker (none), rktp 0x7fa44c0be670) 
%7|1597225157.487|DESTROY|rdkafka#producer-3| [thrd:main]: Sending TERMINATE to sasl_ssl://b6-{key}.westeurope.azure.confluent.cloud:9092/6 
%7|1597225157.487|TOPBRK|rdkafka#producer-3| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: Topic cmd-case-upsert [1]: no next broker, failing 0 message(s) in partition queue 
%7|1597225157.487|TOPBRK|rdkafka#producer-3| [thrd:sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1]: sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1: Topic cmd-case-upsert [0]: leaving broker (0 messages in xmitq, next broker (none), rktp 0x7fa44c0be110) 
%7|1597225157.487|TOPBRK|rdkafka#producer-3| [thrd:sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1]: sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1: Topic cmd-case-upsert [0]: no next broker, failing 0 message(s) in partition queue 
%7|1597225157.487|TOPBRK|rdkafka#producer-3| [thrd:sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1]: sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1: Topic cmd-case-upsert [3]: leaving broker (0 messages in xmitq, next broker (none), rktp 0x7fa44c0bf130) 
%7|1597225157.487|TOPBRK|rdkafka#producer-3| [thrd:sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1]: sasl_ssl://b1-{key}.westeurope.azure.confluent.cloud:9092/1: Topic cmd-case-upsert [3]: no next broker, failing 0 message(s) in partition queue 
%7|1597225157.487|TOPBRK|rdkafka#producer-3| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: Topic cmd-case-upsert [5]: leaving broker (0 messages in xmitq, next broker (none), rktp 0x7fa44c0bfbf0) 
%7|1597225157.487|TOPBRK|rdkafka#producer-3| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: Topic cmd-case-upsert [5]: no next broker, failing 0 message(s) in partition queue 
%7|1597225157.487|TERM|rdkafka#producer-3| [thrd:sasl_ssl://b3-{key}.westeurope.azure.confluent.cloud:9092/3]: sasl_ssl://b3-{key}.westeurope.azure.confluent.cloud:9092/3: Received TERMINATE op in state CONNECT: 2 refcnts, 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs 
%7|1597225157.487|TERM|rdkafka#producer-3| [thrd:sasl_ssl://b7-{key}.westeurope.azure.confluent.cloud:9092/7]: sasl_ssl://b7-{key}.westeurope.azure.confluent.cloud:9092/7: Received TERMINATE op in state INIT: 2 refcnts, 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs 
%7|1597225157.487|FAIL|rdkafka#producer-3| [thrd:sasl_ssl://b3-{key}.westeurope.azure.confluent.cloud:9092/3]: sasl_ssl://b3-{key}.westeurope.azure.confluent.cloud:9092/3: Client is terminating (after 120050ms in state CONNECT) (_DESTROY) 
%7|1597225157.487|STATE|rdkafka#producer-3| [thrd:sasl_ssl://b3-{key}.westeurope.azure.confluent.cloud:9092/3]: sasl_ssl://b3-{key}.westeurope.azure.confluent.cloud:9092/3: Broker changed state CONNECT -> DOWN 
%7|1597225157.487|BRKTERM|rdkafka#producer-3| [thrd:sasl_ssl://b3-{key}.westeurope.azure.confluent.cloud:9092/3]: sasl_ssl://b3-{key}.westeurope.azure.confluent.cloud:9092/3: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s) 
%7|1597225157.487|TERMINATE|rdkafka#producer-3| [thrd:sasl_ssl://b3-{key}.westeurope.azure.confluent.cloud:9092/3]: sasl_ssl://b3-{key}.westeurope.azure.confluent.cloud:9092/3: Handle is terminating in state DOWN: 1 refcnts (0x7fa44c0ae350), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf 
%7|1597225157.487|FAIL|rdkafka#producer-3| [thrd:sasl_ssl://b7-{key}.westeurope.azure.confluent.cloud:9092/7]: sasl_ssl://b7-{key}.westeurope.azure.confluent.cloud:9092/7: Client is terminating (after 120443ms in state INIT) (_DESTROY) 
%7|1597225157.487|STATE|rdkafka#producer-3| [thrd:sasl_ssl://b7-{key}.westeurope.azure.confluent.cloud:9092/7]: sasl_ssl://b7-{key}.westeurope.azure.confluent.cloud:9092/7: Broker changed state INIT -> DOWN 
%7|1597225157.488|BRKTERM|rdkafka#producer-3| [thrd:sasl_ssl://b7-{key}.westeurope.azure.confluent.cloud:9092/7]: sasl_ssl://b7-{key}.westeurope.azure.confluent.cloud:9092/7: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s) 
%7|1597225157.488|TERMINATE|rdkafka#producer-3| [thrd:sasl_ssl://b7-{key}.westeurope.azure.confluent.cloud:9092/7]: sasl_ssl://b7-{key}.westeurope.azure.confluent.cloud:9092/7: Handle is terminating in state DOWN: 1 refcnts (0x7fa44c0a02a0), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf
ReeceKindon commented 4 years ago

Also, we have managed to replicate a timeout when calling "BeginTransaction" with the additional debug logs turned on. See below, again seems an issue connecting with the Broker.

We have continued testing with v1.5 and have had continued success, but are now having to add some defensive code to try and better handle these timeouts when occurring. Hopefully this will get us by for now, until we are able to identify the root cause.

%7|1597241285.214|TXNAPI|rdkafka#producer-7| [thrd:app]: Transactional API called: init_transactions
%7|1597241285.214|TXNSTATE|rdkafka#producer-7| [thrd:main]: Transaction state change Init -> WaitPID
%7|1597241285.214|IDEMPSTATE|rdkafka#producer-7| [thrd:main]: Idempotent producer state change Init -> RequestPID
%7|1597241285.214|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer (fire immediately): Starting idempotent producer
%7|1597241285.215|CONNECT|rdkafka#producer-7| [thrd:main]: Not selecting any broker for cluster connection: still suppressed for 48ms: acquire ProducerID
%7|1597241285.215|PIDBROKER|rdkafka#producer-7| [thrd:main]: No brokers available for Transactions (2 broker(s) known)
%7|1597241285.215|TXNCOORD|rdkafka#producer-7| [thrd:main]: Unable to query for transaction coordinator: No brokers available for Transactions (2 broker(s) known)
%7|1597241285.220|CONNECT|rdkafka#producer-7| [thrd:sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/bootstrap: Connecting to ipv4#51.105.121.2:9092 (sasl_ssl) with socket 128
%7|1597241285.237|CONNECT|rdkafka#producer-7| [thrd:sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/bootstrap: Connected to ipv4#51.105.121.2:9092
%7|1597241285.295|CONNECTED|rdkafka#producer-7| [thrd:sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/bootstrap: Connected (#1)
%7|1597241285.295|FEATURE|rdkafka#producer-7| [thrd:sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/bootstrap: Updated enabled protocol features +ApiVersion to ApiVersion
%7|1597241285.295|STATE|rdkafka#producer-7| [thrd:sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/bootstrap: Broker changed state CONNECT -> APIVERSION_QUERY
%7|1597241285.309|FEATURE|rdkafka#producer-7| [thrd:sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/bootstrap: Updated enabled protocol features to MsgVer1,ApiVersion,BrokerBalancedConsumer,ThrottleTime,Sasl,SaslHandshake,BrokerGroupCoordinator,LZ4,OffsetTime,MsgVer2,IdempotentProducer,ZSTD,SaslAuthReq
%7|1597241285.309|AUTH|rdkafka#producer-7| [thrd:sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/bootstrap: Auth in state APIVERSION_QUERY (handshake supported)
%7|1597241285.309|STATE|rdkafka#producer-7| [thrd:sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/bootstrap: Broker changed state APIVERSION_QUERY -> AUTH_HANDSHAKE
%7|1597241285.324|SASLMECHS|rdkafka#producer-7| [thrd:sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/bootstrap: Broker supported SASL mechanisms: PLAIN,OAUTHBEARER
%7|1597241285.324|AUTH|rdkafka#producer-7| [thrd:sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/bootstrap: Auth in state AUTH_HANDSHAKE (handshake supported)
%7|1597241285.324|STATE|rdkafka#producer-7| [thrd:sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/bootstrap: Broker changed state AUTH_HANDSHAKE -> AUTH_REQ
%7|1597241285.347|STATE|rdkafka#producer-7| [thrd:sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/bootstrap: Broker changed state AUTH_REQ -> UP
%7|1597241285.364|BROKER|rdkafka#producer-7| [thrd:main]: sasl_ssl://b0-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/0: Added new broker with NodeId 0
%7|1597241285.364|BRKMAIN|rdkafka#producer-7| [thrd:sasl_ssl://b0-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/0]: sasl_ssl://b0-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/0: Enter main broker thread
%7|1597241285.365|BROKER|rdkafka#producer-7| [thrd:main]: sasl_ssl://b5-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/5: Added new broker with NodeId 5
%7|1597241285.365|BRKMAIN|rdkafka#producer-7| [thrd:sasl_ssl://b5-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/5]: sasl_ssl://b5-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/5: Enter main broker thread
%7|1597241285.365|BROKER|rdkafka#producer-7| [thrd:main]: sasl_ssl://b10-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/10: Added new broker with NodeId 10
%7|1597241285.365|BRKMAIN|rdkafka#producer-7| [thrd:sasl_ssl://b10-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/10: Enter main broker thread
%7|1597241285.365|BROKER|rdkafka#producer-7| [thrd:main]: sasl_ssl://b2-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/2: Added new broker with NodeId 2
%7|1597241285.365|BRKMAIN|rdkafka#producer-7| [thrd:sasl_ssl://b2-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/2]: sasl_ssl://b2-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/2: Enter main broker thread
%7|1597241285.365|BROKER|rdkafka#producer-7| [thrd:main]: sasl_ssl://b8-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/8: Added new broker with NodeId 8
%7|1597241285.366|BROKER|rdkafka#producer-7| [thrd:main]: sasl_ssl://b9-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/9: Added new broker with NodeId 9
%7|1597241285.366|BRKMAIN|rdkafka#producer-7| [thrd:sasl_ssl://b8-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/8]: sasl_ssl://b8-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/8: Enter main broker thread
%7|1597241285.366|BROKER|rdkafka#producer-7| [thrd:main]: sasl_ssl://b11-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/11: Added new broker with NodeId 11
%7|1597241285.366|BRKMAIN|rdkafka#producer-7| [thrd:sasl_ssl://b11-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/11: Enter main broker thread
%7|1597241285.366|BROKER|rdkafka#producer-7| [thrd:main]: sasl_ssl://b4-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/4: Added new broker with NodeId 4
%7|1597241285.366|BRKMAIN|rdkafka#producer-7| [thrd:sasl_ssl://b9-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/9]: sasl_ssl://b9-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/9: Enter main broker thread
%7|1597241285.366|BROKER|rdkafka#producer-7| [thrd:main]: sasl_ssl://b1-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/1: Added new broker with NodeId 1
%7|1597241285.366|BRKMAIN|rdkafka#producer-7| [thrd:sasl_ssl://b4-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/4]: sasl_ssl://b4-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/4: Enter main broker thread
%7|1597241285.366|BROKER|rdkafka#producer-7| [thrd:main]: sasl_ssl://b6-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/6: Added new broker with NodeId 6
%7|1597241285.366|BROKER|rdkafka#producer-7| [thrd:main]: sasl_ssl://b7-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/7: Added new broker with NodeId 7
%7|1597241285.367|BROKER|rdkafka#producer-7| [thrd:main]: sasl_ssl://b3-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/3: Added new broker with NodeId 3
%7|1597241285.367|CLUSTERID|rdkafka#producer-7| [thrd:main]: sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/bootstrap: ClusterId update "" -> "lkc-l5208"
%7|1597241285.367|CONTROLLERID|rdkafka#producer-7| [thrd:main]: sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/bootstrap: ControllerId update -1 -> 4
%7|1597241285.367|BRKMAIN|rdkafka#producer-7| [thrd:sasl_ssl://b3-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/3]: sasl_ssl://b3-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/3: Enter main broker thread
%7|1597241285.367|BRKMAIN|rdkafka#producer-7| [thrd:sasl_ssl://b1-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/1]: sasl_ssl://b1-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/1: Enter main broker thread
%7|1597241285.367|BRKMAIN|rdkafka#producer-7| [thrd:sasl_ssl://b7-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/7]: sasl_ssl://b7-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/7: Enter main broker thread
%7|1597241285.367|BRKMAIN|rdkafka#producer-7| [thrd:sasl_ssl://b6-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/6]: sasl_ssl://b6-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/6: Enter main broker thread
%7|1597241285.382|TXNCOORD|rdkafka#producer-7| [thrd:main]: sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/bootstrap: FindCoordinator response: Transaction coordinator is broker 5 (b5-pkc-lq8gm.westeurope.azure.confluent.cloud:9092)
%7|1597241285.382|TXNCOORD|rdkafka#producer-7| [thrd:main]: Transaction coordinator changed from (none) -> sasl_ssl://b5-pkc-lq8gm.westeurope.azure.confluent.cloud:9092/5: FindCoordinator response
%7|1597241285.382|NODENAME|rdkafka#producer-7| [thrd:main]: TxnCoordinator: Broker nodename changed from "" to "b5-pkc-lq8gm.westeurope.azure.confluent.cloud:9092"
%7|1597241285.382|IDEMPSTATE|rdkafka#producer-7| [thrd:main]: Idempotent producer state change RequestPID -> WaitTransport
%7|1597241285.382|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241285.382|FAIL|rdkafka#producer-7| [thrd:TxnCoordinator]: TxnCoordinator/5: Closing connection due to nodename change (after 168ms in state TRY_CONNECT) (_TRANSPORT)
%7|1597241285.382|STATE|rdkafka#producer-7| [thrd:TxnCoordinator]: TxnCoordinator: Broker changed state TRY_CONNECT -> DOWN
%7|1597241285.382|STATE|rdkafka#producer-7| [thrd:TxnCoordinator]: TxnCoordinator: Broker changed state DOWN -> INIT
%7|1597241285.382|STATE|rdkafka#producer-7| [thrd:TxnCoordinator]: TxnCoordinator: Broker changed state INIT -> TRY_CONNECT
%7|1597241285.382|CONNECT|rdkafka#producer-7| [thrd:TxnCoordinator]: TxnCoordinator/5: broker in state TRY_CONNECT connecting
%7|1597241285.382|STATE|rdkafka#producer-7| [thrd:TxnCoordinator]: TxnCoordinator: Broker changed state TRY_CONNECT -> CONNECT
%7|1597241285.392|CONNECT|rdkafka#producer-7| [thrd:TxnCoordinator]: TxnCoordinator/5: Connecting to ipv4#51.105.116.3:9092 (sasl_ssl) with socket 208
%7|1597241285.735|TXNCOORD|rdkafka#producer-7| [thrd:main]: sasl_ssl://pkc-lq8gm.westeurope.azure.confluent.cloud:9092/bootstrap: FindCoordinator response: Transaction coordinator is broker 5 (b5-pkc-lq8gm.westeurope.azure.confluent.cloud:9092)
%7|1597241285.882|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241286.382|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241286.882|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241287.382|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241287.882|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241288.382|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241288.882|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241289.382|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241289.882|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241290.382|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241290.882|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241291.382|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241291.883|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241292.383|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241292.883|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241293.383|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241293.883|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241294.383|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241294.883|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241295.383|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241295.883|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241296.383|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241296.883|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241297.383|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241297.883|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241298.383|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241298.883|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241299.383|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241299.883|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241300.384|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241300.884|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241301.384|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241301.884|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241302.384|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241302.884|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241303.384|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241303.884|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241304.384|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241304.884|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241305.384|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241305.884|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241306.384|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241306.884|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241307.384|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241307.884|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241308.384|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241308.884|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241309.384|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241309.884|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241310.384|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241310.884|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241311.384|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241311.884|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241312.384|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241312.885|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241313.385|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241313.885|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241314.385|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241314.885|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241315.385|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241315.885|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241316.385|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241316.885|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241317.385|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241317.885|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241318.385|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241318.885|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241319.385|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241319.885|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241320.385|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241320.885|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241321.385|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241321.886|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241322.386|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241322.886|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241323.386|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241323.886|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241324.386|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241324.886|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241325.386|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241325.886|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241326.386|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241326.886|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241327.386|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241327.886|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241328.386|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241328.886|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241329.386|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241329.886|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241330.386|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241330.886|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241331.387|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241331.887|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241332.387|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241332.887|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241333.387|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241333.887|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241334.387|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241334.887|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241335.387|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241335.887|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241336.387|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241336.887|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241337.387|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241337.887|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241338.387|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241338.887|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241339.387|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241339.887|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241340.387|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241340.888|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241341.388|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241341.888|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241342.388|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241342.888|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241343.388|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241343.888|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241344.388|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241344.888|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241345.388|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241345.888|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241346.388|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241346.888|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241347.388|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241347.888|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241348.388|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241348.888|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241349.388|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241349.888|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241350.388|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241350.889|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241351.389|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241351.889|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241352.389|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241352.889|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241353.389|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241353.889|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241354.389|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241354.889|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241355.389|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241355.889|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241356.389|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241356.889|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241357.389|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241357.890|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241358.390|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241358.890|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241359.390|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241359.890|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241360.390|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241360.890|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241361.390|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241361.890|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241362.390|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241362.890|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241363.390|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241363.890|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241364.390|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241364.890|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241365.390|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241365.891|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241366.391|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241366.891|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241367.391|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241367.891|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241368.391|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241368.891|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241369.391|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241369.891|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241370.391|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241370.891|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241371.391|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241371.891|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241372.391|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241372.891|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241373.391|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241373.891|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241374.392|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241374.892|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241375.392|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241375.892|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241376.393|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241376.893|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241377.393|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241377.893|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241378.393|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241378.893|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241379.393|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241379.893|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241380.393|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241380.893|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241381.393|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241381.893|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241382.393|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241382.893|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241383.393|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241383.893|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241384.393|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241384.893|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241385.393|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241385.893|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241386.394|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241386.894|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241387.394|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241387.894|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241388.394|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241388.894|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241389.394|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241389.894|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241390.394|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241390.894|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241391.394|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241391.894|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241392.394|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241392.894|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241393.394|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241393.894|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241394.394|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241394.895|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241395.395|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241395.895|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241396.395|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241396.895|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241397.395|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241397.895|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241398.395|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241398.895|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241399.395|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241399.895|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241400.395|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241400.895|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241401.395|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241401.895|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241402.395|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241402.896|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241403.396|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241403.896|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241404.396|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
%7|1597241404.896|TXN|rdkafka#producer-7| [thrd:main]: Starting PID FSM timer: No broker available
General exception: Confluent.Kafka.KafkaRetriableException: Failed to initialize Producer ID: Local: Timed out
   at Confluent.Kafka.Impl.SafeKafkaHandle.InitTransactions(Int32 millisecondsTimeout)
   at Confluent.Kafka.Producer`2.InitTransactions(TimeSpan timeout)
mhowlett commented 4 years ago

yeah, connection problem:

%7|1597241285.382|STATE|rdkafka#producer-7| [thrd:TxnCoordinator]: TxnCoordinator: Broker changed state TRY_CONNECT -> CONNECT
%7|1597241285.392|CONNECT|rdkafka#producer-7| [thrd:TxnCoordinator]: TxnCoordinator/5: Connecting to ipv4#51.105.116.3:9092 (sasl_ssl) with socket 208

then nothing. does seem consistent with librdkafka #2845 to me. @edenhill might like to comment (note: he's OOTO until next week).

chrisaddams commented 4 years ago

Hi @mhowlett

We also have the hilariously named refcnt issue too:

%7|1597249316.999|STATE|rdkafka#producer-1071| [thrd:sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4]: sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4: Broker changed state INIT -> DOWN
%7|1597249316.999|BRKTERM|rdkafka#producer-1071| [thrd:sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4]: sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s)
%7|1597249316.999|TERMINATE|rdkafka#producer-1071| [thrd:sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4]: sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4: Handle is terminating in state DOWN: 1 refcnts (0x7f83cc0d53b0), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf
%7|1597249316.999|FAIL|rdkafka#producer-1071| [thrd:sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4]: sasl_ssl://b4-{key}.westeurope.azure.confluent.cloud:9092/4: Broker handle is terminating (after 0ms in state DOWN) (_DESTROY)
%7|1597249316.998|FAIL|rdkafka#producer-1071| [thrd:sasl_ssl://b0-{key}.westeurope.azure.confluent.cloud:9092/0]: sasl_ssl://b0-{key}.westeurope.azure.confluent.cloud:9092/0: Client is terminating (after 672ms in state INIT) (_DESTROY)

Is there anyone who is ITO that might be able to shed a bit more light on the connection problem? this is causing us quite a few production headaches at the moment. We're working to write more defensive code to recover when we're in this state, as restarting our pods forces a rebalance and everything processes. But there are occasions when we seem to drop messages in the midst of this...

mhowlett commented 4 years ago

to clarify, you see the refcnt log message in an infinite loop?

have you tried configuring SocketTimeoutMs down to a more acceptable timeout (maybe 10000ms is workable?), and does this help? are you able to change your code to allow you to continue without re-creating the producer / consumer instances (and did you needed to do more logic over what we've done in the Transaction example, if so, please let me know, if you can).

also feel free to provide the complete set of producer and consumer config settings (sometimes we might see something relevant, particularly if you're tweaking a lot of things, in general, don't tweak anything unless you're sure you need to).

i can continue to give ideas, magnus just wrote all this so will more likely have better ideas/more context off the top of his head.

mhowlett commented 4 years ago

i'd also be surprised if you're getting dropped messages for transactions confirmed as committed - it would be hard for that to occur... it seems more likely to be an error in logic (but won't rule anything our as a possibility).

chrisaddams commented 4 years ago

The refcnt happens as it cycles through brokers, yes. The above was just a snippet.

I've pulled a full list of config - in fact the methods with the config from our library. We haven't tweaked too much, basically what we've been advised to change before whilst trying to achieve Exactly Once delivery.

Our code only recreates consumers/producers if the connections fail, generally they will just continue to process transactions until it drops a connection and stop working. Then they retry, and if that fails they restart - which mostly fixes things... we're still working on this.

We can't be sure exactly where the dropped messages occur, but we did get rid of one bug in our code today where this was occurring. One service was receiving a message and sending additional messages on to two other topics. It was not waiting for both to complete before committing the transaction. We've fixed this issue, and hadn't seen it again until about an hour ago, but seemingly not in this same service. So, as you say - wouldn't rule out a logic issue...

It does seem that all our services go into loops where they try all brokers in the round, consistently receiving a "broker down" response.

Here's our consumer config:

            ConsumerConfig consumerConfig = new ConsumerConfig(){
                GroupId = _groupId,
                BootstrapServers = _brokerList,
                AutoOffsetReset = _autoOffsetReset,
                SecurityProtocol = SecurityProtocol.SaslSsl,
                SaslMechanism = SaslMechanism.Plain,
                SaslUsername = _username,
                SaslPassword = _password,
                SslCaLocation = _caCertLocation,
                StatisticsIntervalMs = 5000,
                SessionTimeoutMs = 10000,
                EnablePartitionEof = true,
                EnableAutoCommit = autoCommit
            };           

            //Add exactlyOnce config if true
            if(exactlyOnce)
            {
                consumerConfig.IsolationLevel = IsolationLevel.ReadCommitted;

                //Reduced MaxPoll - Catch timeouts/failures and re-balance sooner
                consumerConfig.MaxPollIntervalMs = 120000;
            }

Here's our producer config:

            ProducerConfig producerConfig = new ProducerConfig()
            {
                BootstrapServers = _brokerList,
                SecurityProtocol = SecurityProtocol.SaslSsl,
                SaslMechanism = SaslMechanism.Plain,
                SaslUsername = _username,
                SaslPassword = _password,
                SslCaLocation = _caCertLocation
            };

            //Add exactlyOnce config if true
            if(exactlyOnce)
            {
                producerConfig.EnableIdempotence = true;
                producerConfig.MessageSendMaxRetries = 1000;
                producerConfig.TransactionalId = transactionalId;

                //Handling timeouts
                producerConfig.TransactionTimeoutMs = transactionTimeout;
                producerConfig.StatisticsIntervalMs = 30000; //Statistics used to catch Txn Timeouts
            }
mhowlett commented 4 years ago

configs look fine, yes.

It doesn't seem, from what you're said, that we're hitting a bug in librdkafka "SVC's were able to Abort the Transaction and seemingly recover.", unless something in librdkafka is responsible for the connectivity problems (it may well be). I don't know why these are occurring, but given the improvement moving to v1.5, it seems related to idle connections. i have also seen a similar, though not identical network timeout problem recently (not on azure). nothing to go on in the logs - the connection clearly has an issue, but the client is not aware that it does. we didn't get to the root cause in that case, just a fix (which was to use config properties that made sense).

looking at the server metrics for that cluster, I see it was a bit degraded at the time of these logs: https://github.com/confluentinc/confluent-kafka-dotnet/issues/1381#issuecomment-672913423 (under replicated partitions and associated things going on), but it's resolved by the time of these logs: https://github.com/confluentinc/confluent-kafka-dotnet/issues/1381#issuecomment-672913423 (time-period of under-replicated partitions PST 5am -> 9am). I mention in case you correlate that with anything. Don't see anything else notable over the last week.

I'm still not clear if there is a refcnt bug - if the client isn't hanging indefinitely on shutdown there isn't (otherwise it's legitimately waiting for something that takes a long time). sorry probably not asking the right questions there...

AFAICT you are in a workable state? (except the timeouts are not ideal). The next step I want to take here is chat with magnus.

please let us know if the v1.5 client is getting into an unusable state.

chrisaddams commented 4 years ago

We're just cycling through connection issues like the below. Are you suggesting these are not to do with the 1.5 client? It seems unusable to us at this point, though better than 1.4.4.

[10:47:01 INF] CaseDebtorData-Upsert-BSVC MicroService Started Running
%7|1597229221.852|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker in state TRY_CONNECT connecting
%7|1597229221.852|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker has no address yet: postponing connect
[10:47:01 INF] Partitions Assigned: [{"Topic":"cmd-casedebtordata-upsert","Partition":{"Value":3,"IsSpecial":false}}], Pod: 8c6d8fe4-687d-45d7-a9b8-d2980576363d
[10:47:02 INF] Message received: {"Topic": "cmd-casedebtordata-upsert", "Partition": {"Value": 3, "IsSpecial": false, "$type": "Partition"}, "Offset": {"Value": 165313, "IsSpecial": false, "$type": "Offset"}, "TopicPartition": {"Topic": "cmd-casedebtordata-upsert", "Partition": {"Value": 3, "IsSpecial": false, "$type": "Partition"}, "$type": "TopicPartition"}, "TopicPartitionOffset": {"Topic": "cmd-casedebtordata-upsert", "Partition": {"Value": 3, "IsSpecial": false, "$type": "Partition"}, "Offset": {"Value": 165313, "IsSpecial": false, "$type": "Offset"}, "TopicPartition": {"Topic": "cmd-casedebtordata-upsert", "Partition": {"Value": 3, "IsSpecial": false, "$type": "Partition"}, "$type": "TopicPartition"}, "$type": "TopicPartitionOffset"}, "Message": null, "Key": "The property accessor threw an exception: MessageNullException", "Value": "The property accessor threw an exception: MessageNullException", "Timestamp": "The property accessor threw an exception: MessageNullException", "Headers": "The property accessor threw an exception: MessageNullException", "IsPartitionEOF": true, "$type": "ConsumeResult`2"}
[10:47:02 INF] Current partitions: cmd-casedebtordata-upsert-3
Current offsets: 165313
%7|1597229222.852|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker in state TRY_CONNECT connecting
%7|1597229222.852|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker has no address yet: postponing connect
%7|1597229223.852|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker in state TRY_CONNECT connecting
%7|1597229223.852|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker has no address yet: postponing connect
%7|1597229224.852|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker in state TRY_CONNECT connecting
%7|1597229224.852|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker has no address yet: postponing connect
%7|1597229225.852|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker in state TRY_CONNECT connecting
%7|1597229225.852|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker has no address yet: postponing connect
%7|1597229226.853|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker in state TRY_CONNECT connecting
%7|1597229226.853|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker has no address yet: postponing connect
%7|1597229227.853|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker in state TRY_CONNECT connecting
%7|1597229227.853|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker has no address yet: postponing connect
%7|1597229228.853|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker in state TRY_CONNECT connecting
%7|1597229228.853|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker has no address yet: postponing connect
%7|1597229229.853|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker in state TRY_CONNECT connecting
%7|1597229229.853|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker has no address yet: postponing connect
%7|1597229230.853|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker in state TRY_CONNECT connecting
%7|1597229230.853|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker has no address yet: postponing connect
%7|1597229231.853|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker in state TRY_CONNECT connecting
%7|1597229231.853|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker has no address yet: postponing connect
%7|1597229232.854|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker in state TRY_CONNECT connecting
%7|1597229232.854|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker has no address yet: postponing connect
%7|1597229233.854|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker in state TRY_CONNECT connecting

Or this:

%7|1597249314.383|FAIL|rdkafka#producer-115| [thrd:TxnCoordinator]: TxnCoordinator/6: Client is terminating (after 832ms in state UP) (_DESTROY)
%7|1597249314.383|STATE|rdkafka#producer-115| [thrd:TxnCoordinator]: TxnCoordinator: Broker changed state UP -> DOWN
%7|1597249314.383|BRKTERM|rdkafka#producer-115| [thrd:TxnCoordinator]: TxnCoordinator/6: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s)
%7|1597249314.383|TERMINATE|rdkafka#producer-115| [thrd:TxnCoordinator]: TxnCoordinator/6: Handle is terminating in state DOWN: 1 refcnts (0x7f65280b9ac0), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf
%7|1597249314.383|FAIL|rdkafka#producer-115| [thrd:TxnCoordinator]: TxnCoordinator/6: Broker handle is terminating (after 0ms in state DOWN) (_DESTROY)
%7|1597249314.378|TOPBRK|rdkafka#producer-115| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: Topic event-case-upsert [1]: leaving broker (0 messages in xmitq, next broker (none), rktp 0x7f65280909c0)
%7|1597249314.383|TOPBRK|rdkafka#producer-115| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: Topic event-case-upsert [1]: no next broker, failing 0 message(s) in partition queue
%7|1597249314.383|TERM|rdkafka#producer-115| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: Received TERMINATE op in state UP: 2 refcnts, 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs
%7|1597249314.383|FAIL|rdkafka#producer-115| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: Client is terminating (after 178ms in state UP) (_DESTROY)
%7|1597249314.383|STATE|rdkafka#producer-115| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: Broker changed state UP -> DOWN
%7|1597249314.383|BRKTERM|rdkafka#producer-115| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s)
%7|1597249314.383|TERMINATE|rdkafka#producer-115| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: Handle is terminating in state DOWN: 1 refcnts (0x7f652868c890), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf
%7|1597249314.383|FAIL|rdkafka#producer-115| [thrd:sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b10-{key}.westeurope.azure.confluent.cloud:9092/10: Broker handle is terminating (after 0ms in state DOWN) (_DESTROY)
%7|1597249314.378|TOPBRK|rdkafka#producer-115| [thrd:sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/11: Topic event-case-upsert [3]: leaving broker (0 messages in xmitq, next broker (none), rktp 0x7f6528020170)
%7|1597249314.384|TOPBRK|rdkafka#producer-115| [thrd:sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/11: Topic event-case-upsert [3]: no next broker, failing 0 message(s) in partition queue
%7|1597249314.384|TOPBRK|rdkafka#producer-115| [thrd:sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/11: Topic event-case-upsert [4]: leaving broker (0 messages in xmitq, next broker (none), rktp 0x7f65280008d0)
%7|1597249314.384|TOPBRK|rdkafka#producer-115| [thrd:sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/11: Topic event-case-upsert [4]: no next broker, failing 0 message(s) in partition queue
%7|1597249314.384|TERM|rdkafka#producer-115| [thrd:sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/11: Received TERMINATE op in state UP: 2 refcnts, 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs
%7|1597249314.384|FAIL|rdkafka#producer-115| [thrd:sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/11: Client is terminating (after 151ms in state UP) (_DESTROY)
%7|1597249314.384|STATE|rdkafka#producer-115| [thrd:sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/11: Broker changed state UP -> DOWN
%7|1597249314.384|BRKTERM|rdkafka#producer-115| [thrd:sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/11: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s)
%7|1597249314.384|TERMINATE|rdkafka#producer-115| [thrd:sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/11: Handle is terminating in state DOWN: 1 refcnts (0x7f65280a4dc0), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf
%7|1597249314.384|FAIL|rdkafka#producer-115| [thrd:sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/]: sasl_ssl://b11-{key}.westeurope.azure.confluent.cloud:9092/11: Broker handle is terminating (after 0ms in state DOWN) (_DESTROY)
%7|1597249314.384|FAIL|rdkafka#producer-115| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: Client is terminating (after 1035ms in state UP) (_DESTROY)
%7|1597249314.384|STATE|rdkafka#producer-115| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: Broker changed state UP -> DOWN
%7|1597249314.384|BRKTERM|rdkafka#producer-115| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: terminating: broker still has 2 refcnt(s), 0 buffer(s), 0 partition(s)
%7|1597249314.384|TERMINATE|rdkafka#producer-115| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: Handle is terminating in state DOWN: 1 refcnts (0x16216a0), 0 toppar(s), 0 active toppar(s), 0 outbufs, 0 waitresps, 0 retrybufs: failed 0 request(s) in retry+outbuf
%7|1597249314.385|FAIL|rdkafka#producer-115| [thrd:sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/boot]: sasl_ssl://{key}.westeurope.azure.confluent.cloud:9092/bootstrap: Broker handle is terminating (after 0ms in state DOWN) (_DESTROY)

The concerning thing is that these issues have appeared for us on all environments without any major code changes on our part. Has Confluent Cloud had a version bump to any of its components?

It's not like it's waiting for something, or hanging, it's trying to connect and not finding any brokers.

If we have 6 partitions and 6 pods running, each listening to a single partition we fall into a state where 5 have read all their messages and 1 will be 100 messages behind. We look into the logs and we get results like the above logs. We restart any of the 6 pods, doesn't have to be the one with the backlog, a rebalance will occur, all services recover and the messages will be consumed.

For the most part we can identify some of these issues programmatically and recover. Unfortunately, we do occasionally lose messages through this, process and ours is a business where not a single message can be lost. Perhaps this is to do with a commit that is missed, or partially made? unsure at the moment.

So right now, we are having to perform manual interventions on our production environment and although our staging environment is better for the TCP reset on the loadbalancer, upgrade to 1.5 and some defensive recovery logic we're still seeing timeouts and various connection error messages across the board.

mhowlett commented 4 years ago

This to me looks like a network problem not related to librdkafka:

%7|1597229222.852|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker in state TRY_CONNECT connecting
%7|1597229222.852|CONNECT|rdkafka#producer-1| [thrd:TxnCoordinator]: TxnCoordinator: broker has no address yet: postponing connect

It also doesn't seem consistent with the 'azure load balancer' problem we've seen previously.

The second logs look like you're explicitly disposing the producer (so normal).

If this has suddenly started happening, it does seem likely something has changed in confluent cloud. I'm not over what they are doing, or if there have been recent changes - I think at this point can you open a support ticket? (we should probably be doing this via that channel anyway) That way you will get ccloud people in the loop who have that context. Explicitly include that you have been talking to me (Matt Howlett) and a link to this issue.

In order to commit offsets, just confirming you are using Producer.SendOffsetsToTransaction (and you are not using Consumer.Commit). The latter should not be used at all if using transactions. Also check the offset calculations are right (see Transactions example for use of SendOffsetsToTransaction). If all is good, to proceed on the missing messages problem, the next thing I think is probably to start doing some logging of offsets from delivery reports together with something to associate with them in the messages to better pin-point the problem in relation to when offsets are committed. We can also do a quick code review if you can provide code (via support channels). As I said, my base case here is definitely not a bug in librdkafka (but not counting anything out).

chrisaddams commented 4 years ago

Thanks for your help so far.

We actually already have a support ticket open: https://support.confluent.io/hc/en-us/requests/38824 and it does include reference to this github issue also. We haven't seen much movement on this ticket in the last 24hrs.

We'd be happy to go through a code review 👍 every little helps.

Will take a look at those offset commits tomorrow. We have seen 1 or 2 blank message remaining on the queue for each partition after a spike of load? perhaps this could be related? as I'd expect to get back to 0 rather than be left with a blank offset marker?

mhowlett commented 4 years ago

added a note to the support ticket.

you've found the backdoor to the engineers here :-). not 100% reliable though.

mhowlett commented 4 years ago

could this be related to something tricky Kubernetes is doing with the network? is there a way for you to easily count that out?

chrisaddams commented 4 years ago

We have tried all we can to validate any networking issue on the kubernetes side and we're not seeing anything to suggest it's an issue there.

I uploaded some full log files to the support ticket earlier, which are surprising reading. E.g. 1k etc lines of trying to talk to a single broker before it finally connects.

mhowlett commented 2 years ago

going through cleaning up issues and i'm closing this because there have been a lot of transaction related fixes post 1.7.0. And there is another (bug in KIP-360 impl.) coming in v1.10. suspect this relates to something that has now been fixed.