bitnami / charts

Bitnami Helm Charts
https://bitnami.com
Other
9k stars 9.22k forks source link

[bitnami/cassandra] Metrics sidecar cannot scrap metrics #13668

Closed gathanase closed 1 year ago

gathanase commented 1 year ago

Name and Version

bitnami/cassandra 9.7.4

What steps will reproduce the bug?

Running a simple install with the metrics container enabled: helm upgrade --install -f values.yaml cass bitnami/cassandra

Are you using any custom parameters or values?

clusterDomain: gamora.local
resources:
  requests:
    memory: 12Gi
  limits:
    memory: 12Gi
metrics:
  enabled: true

What is the expected behavior?

The sidecar metrics container can scrap cassandra metrics

What do you see instead?

The sidecar metrics container continuously log errors

[main] ERROR com.criteo.nosql.cassandra.exporter.Main - Scrapper stopped due to uncaught exception
java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.ServiceUnavailableException [Root exception is java.rmi.ConnectException: Connection refused to host: localhost; nested exception is: 
    java.net.ConnectException: Connection refused (Connection refused)]
    at java.management.rmi/javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:370)
    at java.management/javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:270)
    at com.criteo.nosql.cassandra.exporter.JmxScraper.run(JmxScraper.java:186)
    at com.criteo.nosql.cassandra.exporter.Main.start(Main.java:44)
    at com.criteo.nosql.cassandra.exporter.Main.main(Main.java:30)
Caused by: javax.naming.ServiceUnavailableException [Root exception is java.rmi.ConnectException: Connection refused to host: localhost; nested exception is: 
    java.net.ConnectException: Connection refused (Connection refused)]
    at jdk.naming.rmi/com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:137)
    at java.naming/com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:220)
    at java.naming/javax.naming.InitialContext.lookup(InitialContext.java:409)
    at java.management.rmi/javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1839)
    at java.management.rmi/javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1813)
    at java.management.rmi/javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:302)
    ... 4 more
Caused by: java.rmi.ConnectException: Connection refused to host: localhost; nested exception is: 
    java.net.ConnectException: Connection refused (Connection refused)
    at java.rmi/sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:623)
    at java.rmi/sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:209)
    at java.rmi/sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:196)
    at java.rmi/sun.rmi.server.UnicastRef.newCall(UnicastRef.java:343)
    at java.rmi/sun.rmi.registry.RegistryImpl_Stub.lookup(RegistryImpl_Stub.java:116)
    at jdk.naming.rmi/com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:133)
    ... 9 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
    at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
    at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
    at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
    at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.base/java.net.Socket.connect(Socket.java:609)
    at java.base/java.net.Socket.connect(Socket.java:558)
    at java.base/java.net.Socket.<init>(Socket.java:454)
    at java.base/java.net.Socket.<init>(Socket.java:231)
    at java.rmi/sun.rmi.transport.tcp.TCPDirectSocketFactory.createSocket(TCPDirectSocketFactory.java:40)
    at java.rmi/sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:617)
    ... 14 more

Additional information

  1. The metric sidecar seems wrongly configured to scrap localhost:5555 (.Values.metrics.containerPorts.jmx) instead of localhost:7199 (.Values.containerPorts.jmx).

So I suggest this change in file charts/bitnami/cassandra/values.yaml, it fixes the problem:

799c799
<     host: localhost:{{ .Values.metrics.containerPorts.jmx }}
---
>     host: localhost:{{ .Values.containerPorts.jmx }}
  1. I think these 2 variables overlap, and find it surprising that the metrics sidecar exposes a jmx port.
  2. Using the above fix solves the problem, but the metrics are painfully slow to retrieve from the sidecar container leading to probe timeout (approx 30 seconds for 100 cassandra tables). I tried using --set metrics.containerPorts.jmx=7199 and it is now fast enough, I am not sure why.
javsalgar commented 1 year ago

Hi!

Thank you so much for reporting. Could you create a PR with the suggestion?

gathanase commented 1 year ago

I am very unconfortable doing a PR as I don't understand why there are 2 jmx ports.

rafariossaa commented 1 year ago

Hi, When testing the change, I deployed a debian container and used curl to get the metrics and it responded without any delay. Do you think this is related to the number of tables ?

rafariossaa commented 1 year ago

Now that the change is merged, could you give it a try ?

gathanase commented 1 year ago

The slow metrics retrieval was due to CPU limits on the metrics sidecar container. Thanks!