DataDog / integrations-core

Core integrations of the Datadog Agent
BSD 3-Clause "New" or "Revised" License
880 stars 1.37k forks source link

[BUG] KafkaError{code=_TRANSPORT,val=-195,str="Failed to get metadata: Local: Broker transport failure"} #17914

Open froque opened 1 week ago

froque commented 1 week ago

Agent Environment

$ sudo datadog-agent version 
Agent 7.54.1 - Commit: 44d1992 - Serialization version: v5.0.114 - Go version: go1.21.9

Describe what happened:

After upgrading to 7.54.0, Kafka consumer lag checks started to fail

Describe what you expected:

Expected Datadog Agent to continue to get Kafka consumer lag offsets from Kafka cluster.

Steps to reproduce the issue:

instances:

Additional environment details (Operating System, Cloud provider, etc):

froque commented 1 week ago

As a workaround, disabling tls_verify or setting tls_ca_cert works

$ tail -n2 /etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml
    tls_verify: false
    tls_ca_cert: /opt/datadog-agent/embedded/ssl/certs/cacert.pem
FlorentClarret commented 1 week ago

Hello @froque! Thanks for opening this issue and the workaround. I'm going to transfer the issue to integrations-core because this is where the integrations lives. I'll let them know so they'll be able to take care of this.

HadhemiDD commented 6 days ago

@froque can you open a support case? Also, you can use the script in tests/python_client/script.py to run a barebones connection directly to the cluster for debugging. This script will attempt a connection and then fetch all of the consumer groups for that configuration. Please include it with the support case along with a Debug flare.

froque commented 6 days ago
$ /opt/datadog-agent/embedded/bin/python script.py 
bootstrap.servers=<redacted>
socket.timeout.ms=5000
client.id=dd-agent
security.protocol=sasl_ssl
ssl.endpoint.identification.algorithm=none
enable.ssl.certificate.verification=true
sasl.mechanism=PLAIN
sasl.username=<redacted>
sasl.password=*****
Connecting to AdminClient
%3|1719239854.080|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239854.080|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239854.080|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239854.080|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239854.080|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239854.081|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239854.081|FAIL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: SSL handshake failed: error:0A000086:SSL routines::certificate verify failed: broker certificate could not be verified, verify that ssl.ca.location is correctly configured or root CA certificates are installed (install ca-certificates package) (after 34ms in state SSL_HANDSHAKE)
%3|1719239855.009|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239855.009|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239855.010|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239855.010|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239855.010|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:16000069:STORE routines::unregistered scheme: scheme=file
%3|1719239855.010|SSL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: error:80000002:system library::No such file or directory: calling stat(/usr/local/ssl/certs)
%3|1719239855.010|FAIL|dd-agent#producer-1| [thrd:sasl_ssl://<redacted>:9092/bootstr]: sasl_ssl://<redacted>:9092/bootstrap: SSL handshake failed: error:0A000086:SSL routines::certificate verify failed: broker certificate could not be verified, verify that ssl.ca.location is correctly configured or root CA certificates are installed (install ca-certificates package) (after 32ms in state SSL_HANDSHAKE, 1 identical error(s) suppressed)
^CTraceback (most recent call last):
  File "/home/pminds/script.py", line 87, in <module>
    main()
  File "/home/pminds/script.py", line 80, in main
    results = future.result()
              ^^^^^^^^^^^^^^^
  File "/opt/datadog-agent/embedded/lib/python3.11/concurrent/futures/_base.py", line 451, in result
    self._condition.wait(timeout)
  File "/opt/datadog-agent/embedded/lib/python3.11/threading.py", line 327, in wait
    waiter.acquire()
KeyboardInterrupt

From what I have already explored, it seems that in version v7.54.0 it expects a file in /usr/local/ssl/certs and not in /opt/datadog-agent/embedded/ssl/certs/ like in v7.53.0.

froque commented 5 days ago

Your logs were successfully uploaded. For future reference, your internal case id is 1751844

HadhemiDD commented 4 days ago

From what I have already explored, it seems that in version v7.54.0 it expects a file in /usr/local/ssl/certs and not in /opt/datadog-agent/embedded/ssl/certs/ like in v7.53.0.

=> @froque
Can you elaborate on where did you find this change? Also, can you try to use port 9091 instead for the kafka broker (update the config on kafka side) and set the same port on datadog side (in the script.py) then try to run the script again and see if it works?

froque commented 3 days ago

@HadhemiDD I messed around in differences between the v73 and v74 debian files.

❯ wget --quiet https://apt.datadoghq.com/pool/d/da/datadog-agent_7.53.0-1_amd64.deb
❯ wget --quiet https://apt.datadoghq.com/pool/d/da/datadog-agent_7.54.0-1_amd64.deb
❯ mkdir v7.53 v7.54
❯ ar --output v7.53 x datadog-agent_7.53.0-1_amd64.deb 
❯ ar --output v7.54 x datadog-agent_7.54.0-1_amd64.deb 
❯ tar --directory=v7.53 -Jxf v7.53/data.tar.xz
❯ tar --directory=v7.54 -Jxf v7.54/data.tar.xz

I noticed that librdkafka is no longer in the same path

❯ find -name \*librdkafka\*so\* -type f
./v7.53/opt/datadog-agent/embedded/lib/librdkafka++.so.1
./v7.53/opt/datadog-agent/embedded/lib/librdkafka.so.1
./v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/confluent_kafka.libs/librdkafka-27145264.so.1

And a new libcrypto exists

❯ find -name \*libcrypto\*so\* -type f| sort                 
./v7.53/opt/datadog-agent/embedded/lib/libcrypto.so.3
./v7.53/opt/datadog-agent/embedded/lib/python3.11/site-packages/psycopg2_binary.libs/libcrypto-7d0e8add.so.1.1
./v7.54/opt/datadog-agent/embedded/lib/libcrypto.so.3
./v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/aerospike.libs/libcrypto-e31f2095.so.3
./v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/confluent_kafka.libs/libcrypto-b840c11b.so.3
./v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/psycopg2_binary.libs/libcrypto-7d0e8add.so.1.1

searching for some strings

❯ rgrep '/opt/datadog-agent/embedded/ssl/certs' v7* 
grep: v7.53/opt/datadog-agent/embedded/lib/libcrypto.so.3: binary file matches
grep: v7.54/opt/datadog-agent/embedded/lib/libcrypto.so.3: binary file matches
❯ rgrep '/usr/local/ssl/certs' v7*
grep: v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/confluent_kafka.libs/libcrypto-b840c11b.so.3: binary file matches
grep: v7.54/opt/datadog-agent/embedded/lib/python3.11/site-packages/aerospike.libs/libcrypto-e31f2095.so.3: binary file matches