confluentinc / confluent-kafka-dotnet

Confluent's Apache Kafka .NET client
https://github.com/confluentinc/confluent-kafka-dotnet/wiki
Apache License 2.0
2.78k stars 847 forks source link

MSK connectivity issue during AWS Security Patch Updates #2164

Open DevOnRun opened 6 months ago

DevOnRun commented 6 months ago

Description

Facing issues while consuming event from Kafka using AWS MSK during security patch updates.

How to reproduce

  1. Launch an Consumer application using AWS MSK as Kafka infrastructure.
  2. Wait for roll out MSK updates or applying security patch automatically or apply manually if possible

Additional Details

On further observation while debugging Error.Code returned as Local_Transport

Checklist

Program: Basic Consumer application (Regularly consume events) Confluent.Kafka nuget version: 2.2.0 Apache Kafka version: 2.8.1 Client configuration: EnableAutoCommit = false; EnableAutoOffsetStore = false;

Info Logs: ssl://b-1.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaws.com:9094/1: Connect to ipv4#<ip-address-1>:9094 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed) [thrd:ssl://b-1.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaw]: ssl://b-1.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaws.com:9094/1: Connect to ipv4#<ip-address-1>:9094 failed: Connection refused (after 0ms in state CONNECT, 1 identical error(s) suppressed) 2/2 brokers are down ssl://b-1.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaws.com:9094/1: Disconnected: verify that security.protocol is correctly configured, broker might require SASL authentication (after -1616061376ms in state UP) GroupCoordinator: b-2.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaws.com:9094: Connect to ipv4#<ip-address-3>:9094 failed: Connection refused (after 1ms in state CONNECT, 1 identical error(s) suppressed) [thrd:GroupCoordinator]: GroupCoordinator: b-2.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaws.com:9094: Connect to ipv4#<ip-address-3>:9094 failed: Connection refused (after 1ms in state CONNECT, 1 identical error(s) suppressed) ssl://b-2.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaws.com:9094/2: Connect to ipv4#<ip-address-3>:9094 failed: Connection refused (after 1ms in state CONNECT, 1 identical error(s) suppressed) [thrd:ssl://b-2.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaw]: ssl://b-2.devmsk.<unique-id-1>.c17.kafka.ap-south-1.amazonaws.com:9094/2: Connect to ipv4#<ip-address-3>:9094 failed: Connection refused (after 1ms in state CONNECT, 1 identical error(s) suppressed)

Please provide the following information:

DevOnRun commented 5 months ago

Similar unanswered Issues:

prashantalhat commented 5 months ago

Any updates on this? I am facing same issue during security patching for MSK cluster.

anchitj commented 4 months ago

Was the broker reachable? Can you provide more logs?

DevOnRun commented 4 months ago

On further enhancing logs by adding remaining properties for SetLogHandler() and SetErrorHandler() implementation found:

NOTE: All the above logs are produced with loglevel as either of Info/Warning/Error. Nothing else is produced even after enabling Debug loglevel

neerajk22 commented 3 months ago

@anchitj Getting same error after MSK patching activity and after this Kafka client(Consumer code) not able to connect again, only option is to restart the pods(service). Kindly let me know of there is any way to configure consumer code to re-initiates the connection

%4|1710193417.943|FAIL|rdkafka#consumer-9886| [thrd:sasl_ssl://brokeraddress.amazonaws.com]: sasl_ssl://b-3.msk-wbrokeraddress.amazonaws.com:9096/3: Connection setup timed out in state APIVERSION_QUERY (after 29924ms in state APIVERSION_QUERY, 1 identical error(s) suppressed) %4|1710193447.946|FAIL|rdkafka#consumer-9886| [thrd:sasl_ssl://b-2.brokeraddress.amazonaws.com]: sasl_ssl://b-2.msk-wbroker.amazonaws.com:9096/2: Connection setup timed out in state APIVERSION_QUERY (after 29912ms in state APIVERSION_QUERY, 1 identical error(s) suppressed)

anchitj commented 2 months ago

Client should keep retrying on its own and this error should be transient. Please try to reproduce once again with Debug="all" and upload the logs here.