DataDog / jmxfetch

Export JMX metrics
Other
98 stars 68 forks source link

Jmxfetch OutOfMemoryError being suppressed #519

Open jk2l opened 6 months ago

jk2l commented 6 months ago

It seem that the system at some point failed and it never able to recover. the only resolution at the moment is to restart datadog-agent. I am suspecting the bug is caused by https://github.com/DataDog/jmxfetch/pull/432 where during the clean up this.connection set to null but the instance never re-initialize the connection.

I am not familiar with Java as i am not java developer, but i am thinking will replacing connection.getMBeanInfo to this.getConnection().getMBeanInfo fix this issue. but the getConnection() probably need to set this.connection too

2024-04-16 05:15:51 NZST | JMX | INFO | Instance | Refreshing bean list for jboss_wildfly 2024-04-16 05:16:51 NZST | JMX | WARN | App | Unable to collect metrics or refresh bean list.: java.util.concurrent.CancellationException 2024-04-16 05:16:51 NZST | JMX | WARN | App | Instance jboss_hcn didn't return any metrics. Maybe the server got disconnected ? Trying to reconnect. 2024-04-16 05:16:51 NZST | JMX | WARN | Instance | Cannot get attributes or class name for bean jboss.as:subsystem=infinispan,cache-container=hibernate,cache=hcn.ear/hcn-codetabledata-jpa#CodeTableJPAUnit.nz.govt.cyf.hcn.server.entity.codetables.PlanDurationType: java.lang.NullPointerException: Cannot invoke "org.datadog.jmxfetch.Connection.getMBeanInfo(javax.management.ObjectName)" because "this.connection" is null at org.datadog.jmxfetch.Instance.getMatchingAttributes(Instance.java:573) at org.datadog.jmxfetch.Instance.getMetrics(Instance.java:482) at org.datadog.jmxfetch.MetricCollectionTask.call(MetricCollectionTask.java:26) at org.datadog.jmxfetch.MetricCollectionTask.call(MetricCollectionTask.java:8) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840)

Tested on:

======== JMXFetch

Information

runtime_version : 17.0.9
version : 0.48.0

Initialized checks

jboss_wildfly
- instance_name: jboss_hcn
  metric_count: 0
  service_check_count: 0
  message: Unable to instantiate or initialize instance jboss_hcn for an unknown reason.null
  status: ERROR
- instance_name: jboss_hcn
  metric_count: 0
  service_check_count: 0
  message: Unable to instantiate or initialize instance jboss_hcn for an unknown reason.null
  status: ERROR

Failed checks

no checks

one side effect happen is that i see two instances when it start to fail. but if it is running normally, it only show one instance

jk2l commented 6 months ago

please note that this have been broken for all of our production environment recently when we updated our system to use 0.48.0 jmx patch

jk2l commented 6 months ago

Also there were only one "Trying to reconnect.", it never happened again even if there is continuous failing

jk2l commented 6 months ago

after the instance failed a while, there are new type of error messags

2024-04-17 17:05:58 NZST | JMX | WARN | App | No instance could be initiated. Retrying initialization. 2024-04-17 17:05:58 NZST | JMX | INFO | App | Found 0 config files 2024-04-17 17:05:58 NZST | JMX | INFO | App | Cleaning up instances... 2024-04-17 17:05:58 NZST | JMX | INFO | InstanceCleanupTask | Trying to cleanup: jboss_hcn 2024-04-17 17:05:58 NZST | JMX | INFO | App | Dealing with YAML config instances... 2024-04-17 17:05:58 NZST | JMX | INFO | App | Dealing with Auto-Config instances collected... 2024-04-17 17:05:58 NZST | JMX | INFO | App | Instantiating instance for: jboss_wildfly 2024-04-17 17:05:59 NZST | JMX | INFO | App | Started instance initialization... 2024-04-17 17:05:59 NZST | JMX | INFO | Instance | Trying to connect to JMX Server at jboss_hcn 2024-04-17 17:05:59 NZST | JMX | INFO | Instance | Connection closed or does not exist. Attempting to create a new connection... 2024-04-17 17:05:59 NZST | JMX | INFO | ConnectionFactory | Connecting using JMX Remote

2024-04-17 17:05:59 NZST | JMX | INFO | Connection | Connecting to: service:jmx:remote+http://10.190.196.69:9990 2024-04-17 17:05:59 NZST | JMX | INFO | App | Completed instance initialization... 2024-04-17 17:05:59 NZST | JMX | WARN | App | Could not initialize instance: jboss_hcn: java.util.concurrent.ExecutionException: java.nio.channels.ClosedSelectorException at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) at org.datadog.jmxfetch.App.processRecoveryResults(App.java:998) at org.datadog.jmxfetch.App$5.invoke(App.java:944) at org.datadog.jmxfetch.tasks.TaskProcessor.processTasks(TaskProcessor.java:63) at org.datadog.jmxfetch.App.init(App.java:934) at org.datadog.jmxfetch.App.start(App.java:421) at org.datadog.jmxfetch.App.run(App.java:179) at org.datadog.jmxfetch.JmxFetch.main(JmxFetch.java:57) at org.datadog.jmxfetch.App.main(App.java:92) Caused by: java.nio.channels.ClosedSelectorException at java.base/sun.nio.ch.SelectorImpl.ensureOpen(SelectorImpl.java:80) at java.base/sun.nio.ch.SelectorImpl.implRegister(SelectorImpl.java:239) at java.base/sun.nio.ch.SelectorImpl.register(SelectorImpl.java:215) at java.base/java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:236) at java.base/java.nio.channels.SelectableChannel.register(SelectableChannel.java:260) at org.xnio.nio.WorkerThread.registerChannel(WorkerThread.java:743) at org.xnio.nio.WorkerThread.openTcpStreamConnection(WorkerThread.java:262) at org.xnio.XnioIoThread.internalOpenTcpStreamConnection(XnioIoThread.java:254) at org.xnio.XnioIoThread.openStreamConnection(XnioIoThread.java:226) at org.xnio.XnioWorker.openStreamConnection(XnioWorker.java:405) at org.jboss.remoting3.remote.RemoteConnectionProvider.createSslConnection(RemoteConnectionProvider.java:246) at org.jboss.remoting3.remote.HttpUpgradeConnectionProvider.createSslConnection(HttpUpgradeConnectionProvider.java:135) at org.jboss.remoting3.remote.RemoteConnectionProvider.connect(RemoteConnectionProvider.java:206) at org.jboss.remoting3.EndpointImpl.connect(EndpointImpl.java:649) at org.jboss.remoting3.EndpointImpl.connect(EndpointImpl.java:561) at org.jboss.remoting3.EndpointImpl.connect(EndpointImpl.java:549) at org.jboss.remotingjmx.RemotingConnector.internalRemotingConnect(RemotingConnector.java:268) at org.jboss.remotingjmx.RemotingConnector.internalConnect(RemotingConnector.java:156) at org.jboss.remotingjmx.RemotingConnector.connect(RemotingConnector.java:103) at java.management/javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:270) at org.datadog.jmxfetch.Connection.createConnection(Connection.java:57) at org.datadog.jmxfetch.RemoteConnection.(RemoteConnection.java:101) at org.datadog.jmxfetch.ConnectionFactory.createConnection(ConnectionFactory.java:40) at org.datadog.jmxfetch.Instance.getConnection(Instance.java:455) at org.datadog.jmxfetch.Instance.init(Instance.java:468) at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:15) at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:3) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840)

2024-04-17 17:05:59 NZST | JMX | WARN | App | Unable to instantiate or initialize instance jboss_hcn for an unknown reason.null

carlosroman commented 6 months ago

What version of the Agent are you using?

jk2l commented 6 months ago

0.48.0 - Agent 6.49.1 - Commit: 1790cab - Serialization version: v5.0.97 - Go version: go1.20.10 0.49.0 - Agent 6.52.1 - Commit: 51dd448 - Serialization version: v5.0.104 - Go version: go1.21.8

jk2l commented 6 months ago

i also tried to reproduce disconnect connections and force this fail via iptables DROP. but it don't seem to be able to reproduce this reconnect issue

updating 0.49.0 server with Agent 7.52.1 - Commit: 51dd448 - Serialization version: v5.0.104 - Go version: go1.21.8

jk2l commented 6 months ago

here is log if using iptables DROP

24-04-18 16:54:28 NZST | JMX | WARN | App | No instance could be initiated. Retrying initialization. 2024-04-18 16:54:28 NZST | JMX | INFO | App | Found 0 config files 2024-04-18 16:54:28 NZST | JMX | INFO | App | Cleaning up instances... 2024-04-18 16:54:28 NZST | JMX | INFO | InstanceCleanupTask | Trying to cleanup: jboss_hcn 2024-04-18 16:54:28 NZST | JMX | INFO | App | Dealing with YAML config instances... 2024-04-18 16:54:28 NZST | JMX | INFO | App | Dealing with Auto-Config instances collected... 2024-04-18 16:54:28 NZST | JMX | INFO | App | Instantiating instance for: jboss_wildfly 2024-04-18 16:54:28 NZST | JMX | INFO | App | Started instance initialization... 2024-04-18 16:54:28 NZST | JMX | INFO | Instance | Trying to connect to JMX Server at jboss_hcn 2024-04-18 16:54:28 NZST | JMX | INFO | Instance | Connection closed or does not exist. Attempting to create a new connection... 2024-04-18 16:54:28 NZST | JMX | INFO | ConnectionFactory | Connecting using JMX Remote 2024-04-18 16:54:28 NZST | JMX | INFO | Connection | Connecting to: service:jmx:remote+http://10.190.196.69:9990 2024-04-18 16:54:58 NZST | JMX | INFO | App | Completed instance initialization... 2024-04-18 16:54:58 NZST | JMX | WARN | App | Could not initialize instance: jboss_hcn: java.util.concurrent.ExecutionException: java.io.IOException: Operation failed with status WAITING at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) at org.datadog.jmxfetch.App.processRecoveryResults(App.java:998) at org.datadog.jmxfetch.App$5.invoke(App.java:944) at org.datadog.jmxfetch.tasks.TaskProcessor.processTasks(TaskProcessor.java:63) at org.datadog.jmxfetch.App.init(App.java:934) at org.datadog.jmxfetch.App.start(App.java:421) at org.datadog.jmxfetch.App.run(App.java:179) at org.datadog.jmxfetch.JmxFetch.main(JmxFetch.java:57) at org.datadog.jmxfetch.App.main(App.java:92) Caused by: java.io.IOException: Operation failed with status WAITING at org.jboss.remotingjmx.RemotingConnector.internalRemotingConnect(RemotingConnector.java:276) at org.jboss.remotingjmx.RemotingConnector.internalConnect(RemotingConnector.java:156) at org.jboss.remotingjmx.RemotingConnector.connect(RemotingConnector.java:103) at java.management/javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:270) at org.datadog.jmxfetch.Connection.createConnection(Connection.java:57) at org.datadog.jmxfetch.RemoteConnection.(RemoteConnection.java:101) at org.datadog.jmxfetch.ConnectionFactory.createConnection(ConnectionFactory.java:40) at org.datadog.jmxfetch.Instance.getConnection(Instance.java:455) at org.datadog.jmxfetch.Instance.init(Instance.java:468) at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:15) at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:3) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840)

2024-04-18 16:54:58 NZST | JMX | WARN | App | Unable to instantiate or initialize instance jboss_hcn. Is the target JMX Server or JVM running? Operation failed with status WAITING 2024-04-18 16:55:05 NZST | JMX | INFO | App | update is in order - updating timestamp: 1713416105

jk2l commented 6 months ago

it seem the issue was jmxfetch out of heap space. i think the OutOfMemory error should not be under debug mode error logging

here is before and after turning on/off debug mode

2024-04-12 11:51:15 NZST | JMX | INFO | Instance | Trying to collect bean list for the first time for JMX Server at jboss_hcn
2024-04-12 11:51:20 NZST | JMX | INFO | Instance | Connected to JMX Server at jboss_hcn with 3556 beans
2024-04-12 11:52:16 NZST | JMX | INFO | App | Completed instance initialization...
2024-04-12 11:52:16 NZST | JMX | WARN | App | Could not initialize instance: jboss_hcn:
java.util.concurrent.CancellationException
        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:121)
        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at org.datadog.jmxfetch.App.processRecoveryResults(App.java:998)
        at org.datadog.jmxfetch.App$5.invoke(App.java:944)
        at org.datadog.jmxfetch.tasks.TaskProcessor.processTasks(TaskProcessor.java:63)
        at org.datadog.jmxfetch.App.init(App.java:934)
        at org.datadog.jmxfetch.App.start(App.java:421)
        at org.datadog.jmxfetch.App.run(App.java:179)
        at org.datadog.jmxfetch.JmxFetch.main(JmxFetch.java:57)
        at org.datadog.jmxfetch.App.main(App.java:92)

Debug mode on

2024-04-27 08:09:10 NZST | JMX | INFO | Connection | Connecting to: service:jmx:remote+http://10.190.197.136:9990
2024-04-27 08:09:10 NZST | JMX | INFO | Instance | Trying to collect bean list for the first time for JMX Server at jboss_hcn
2024-04-27 08:09:10 NZST | JMX | DEBUG | Connection | Querying bean names on scope: java.lang:*
2024-04-27 08:09:10 NZST | JMX | DEBUG | Connection | Querying bean names on scope: java.nio:type=BufferPool,*
2024-04-27 08:09:10 NZST | JMX | DEBUG | Connection | Querying bean names on scope: jboss.as:*
Exception in thread "remoting-jmx client-thread-50" java.lang.OutOfMemoryError: Java heap space
2024-04-27 08:10:10 NZST | JMX | INFO | App | Completed instance initialization...
2024-04-27 08:10:10 NZST | JMX | WARN | App | Could not initialize instance: jboss_hcn:
java.util.concurrent.CancellationException
        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:121)
        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at org.datadog.jmxfetch.App.processRecoveryResults(App.java:1024)
        at org.datadog.jmxfetch.App$5.invoke(App.java:970)
        at org.datadog.jmxfetch.tasks.TaskProcessor.processTasks(TaskProcessor.java:63)
        at org.datadog.jmxfetch.App.init(App.java:960)
        at org.datadog.jmxfetch.App.start(App.java:417)
        at org.datadog.jmxfetch.App.run(App.java:181)
        at org.datadog.jmxfetch.JmxFetch.main(JmxFetch.java:57)
        at org.datadog.jmxfetch.App.main(App.java:94)
jk2l commented 6 months ago

Updated issue title to reflect the issue better

carlosroman commented 5 months ago

Thanks for the detailed analysis of the issue. No 100% sure why that error message only appears if debug enabled in the Agent. Will have a look at it.

jk2l commented 5 months ago

@carlosroman Update

Increased agent memory from 200MB to 600MB, it is still dying. atm i am not sure is there some sort of memory leak happening in datadog agent

the first error produced have more detailed stack trace, but subsequence error do not have stack trace for heap space exception

2024-05-15 16:58:30 NZST | JMX | DEBUG | Connection | Querying bean names on scope: jboss.as:*
Exception in thread "remoting-jmx client-thread-82" java.lang.OutOfMemoryError: Java heap space
        at java.base/java.lang.StringLatin1.toChars(StringLatin1.java:74)
        at java.base/java.lang.String.toCharArray(String.java:4103)
        at java.management/javax.management.ObjectName.construct(ObjectName.java:441)
        at java.management/javax.management.ObjectName.readObject(ObjectName.java:1177)
        at java.base/java.lang.invoke.LambdaForm$DMH/0x00007f7668200400.invokeSpecial(LambdaForm$DMH)
        at java.base/java.lang.invoke.LambdaForm$MH/0x00007f766827c400.invoke(LambdaForm$MH)
        at java.base/java.lang.invoke.LambdaForm$MH/0x00007f766823c800.invoke_MT(LambdaForm$MH)
        at org.jboss.marshalling.reflect.JDKSpecific$SerMethods.callReadObject(JDKSpecific.java:171)
        at org.jboss.marshalling.reflect.SerializableClass.callReadObject(SerializableClass.java:252)
        at org.jboss.marshalling.river.RiverUnmarshaller.doInitSerializable(RiverUnmarshaller.java:1997)
        at org.jboss.marshalling.river.RiverUnmarshaller.doReadNewObject(RiverUnmarshaller.java:1601)
        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:441)
        at org.jboss.marshalling.river.BlockUnmarshaller.readObject(BlockUnmarshaller.java:149)
        at org.jboss.marshalling.river.BlockUnmarshaller.readObject(BlockUnmarshaller.java:135)
        at org.jboss.marshalling.MarshallerObjectInputStream.readObjectOverride(MarshallerObjectInputStream.java:53)
        at org.jboss.marshalling.river.RiverObjectInputStream.readObjectOverride(RiverObjectInputStream.java:307)
        at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:500)
        at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:467)
        at java.base/java.util.HashSet.readObject(HashSet.java:345)
        at java.base/java.lang.invoke.LambdaForm$DMH/0x00007f7668200400.invokeSpecial(LambdaForm$DMH)
        at java.base/java.lang.invoke.LambdaForm$MH/0x00007f766827b000.invoke(LambdaForm$MH)
        at java.base/java.lang.invoke.LambdaForm$MH/0x00007f766823c800.invoke_MT(LambdaForm$MH)        at org.jboss.marshalling.reflect.JDKSpecific$SerMethods.callReadObject(JDKSpecific.java:171)        at org.jboss.marshalling.reflect.SerializableClass.callReadObject(SerializableClass.java:252)        at org.jboss.marshalling.river.RiverUnmarshaller.doInitSerializable(RiverUnmarshaller.java:1997)        at org.jboss.marshalling.river.RiverUnmarshaller.doReadNewObject(RiverUnmarshaller.java:1601)
        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:441)
        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:373)
        at org.jboss.marshalling.AbstractObjectInput.readObject(AbstractObjectInput.java:41)
        at org.jboss.remotingjmx.protocol.v2.ClientCommon$MarshalledResponseHandler.readValue(ClientCommon.java:224)        at org.jboss.remotingjmx.protocol.v2.ClientCommon$BaseResponseHandler.handle(ClientCommon.java:176)        at org.jboss.remotingjmx.protocol.v2.ClientCommon$MessageReceiver$1.run(ClientCommon.java:123)2024-05-15 16:59:30 NZST | JMX | INFO | App | Completed instance initialization...2024-05-15 16:59:30 NZST | JMX | WARN | App | Could not initialize instance: jboss_hcn:
java.util.concurrent.CancellationException        at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:121)        at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
        at org.datadog.jmxfetch.App.processRecoveryResults(App.java:1024)
        at org.datadog.jmxfetch.App$5.invoke(App.java:970)
        at org.datadog.jmxfetch.tasks.TaskProcessor.processTasks(TaskProcessor.java:63)
        at org.datadog.jmxfetch.App.init(App.java:960)
        at org.datadog.jmxfetch.App.start(App.java:417)
        at org.datadog.jmxfetch.App.run(App.java:181)
        at org.datadog.jmxfetch.JmxFetch.main(JmxFetch.java:57)
        at org.datadog.jmxfetch.App.main(App.java:94)
phongvq commented 5 months ago

@jk2l side question, do you know how I can limit the number of Heatbeat JMX fetch threads?

jk2l commented 5 months ago

@jk2l side question, do you know how I can limit the number of Heatbeat JMX fetch threads?

nope, i am still figuring how this whole thing work atm