Open jk2l opened 6 months ago
please note that this have been broken for all of our production environment recently when we updated our system to use 0.48.0 jmx patch
Also there were only one "Trying to reconnect.", it never happened again even if there is continuous failing
after the instance failed a while, there are new type of error messags
2024-04-17 17:05:58 NZST | JMX | WARN | App | No instance could be initiated. Retrying initialization. 2024-04-17 17:05:58 NZST | JMX | INFO | App | Found 0 config files 2024-04-17 17:05:58 NZST | JMX | INFO | App | Cleaning up instances... 2024-04-17 17:05:58 NZST | JMX | INFO | InstanceCleanupTask | Trying to cleanup: jboss_hcn 2024-04-17 17:05:58 NZST | JMX | INFO | App | Dealing with YAML config instances... 2024-04-17 17:05:58 NZST | JMX | INFO | App | Dealing with Auto-Config instances collected... 2024-04-17 17:05:58 NZST | JMX | INFO | App | Instantiating instance for: jboss_wildfly 2024-04-17 17:05:59 NZST | JMX | INFO | App | Started instance initialization... 2024-04-17 17:05:59 NZST | JMX | INFO | Instance | Trying to connect to JMX Server at jboss_hcn 2024-04-17 17:05:59 NZST | JMX | INFO | Instance | Connection closed or does not exist. Attempting to create a new connection... 2024-04-17 17:05:59 NZST | JMX | INFO | ConnectionFactory | Connecting using JMX Remote
2024-04-17 17:05:59 NZST | JMX | INFO | Connection | Connecting to: service:jmx:remote+http://10.190.196.69:9990 2024-04-17 17:05:59 NZST | JMX | INFO | App | Completed instance initialization... 2024-04-17 17:05:59 NZST | JMX | WARN | App | Could not initialize instance: jboss_hcn: java.util.concurrent.ExecutionException: java.nio.channels.ClosedSelectorException at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) at org.datadog.jmxfetch.App.processRecoveryResults(App.java:998) at org.datadog.jmxfetch.App$5.invoke(App.java:944) at org.datadog.jmxfetch.tasks.TaskProcessor.processTasks(TaskProcessor.java:63) at org.datadog.jmxfetch.App.init(App.java:934) at org.datadog.jmxfetch.App.start(App.java:421) at org.datadog.jmxfetch.App.run(App.java:179) at org.datadog.jmxfetch.JmxFetch.main(JmxFetch.java:57) at org.datadog.jmxfetch.App.main(App.java:92) Caused by: java.nio.channels.ClosedSelectorException at java.base/sun.nio.ch.SelectorImpl.ensureOpen(SelectorImpl.java:80) at java.base/sun.nio.ch.SelectorImpl.implRegister(SelectorImpl.java:239) at java.base/sun.nio.ch.SelectorImpl.register(SelectorImpl.java:215) at java.base/java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:236) at java.base/java.nio.channels.SelectableChannel.register(SelectableChannel.java:260) at org.xnio.nio.WorkerThread.registerChannel(WorkerThread.java:743) at org.xnio.nio.WorkerThread.openTcpStreamConnection(WorkerThread.java:262) at org.xnio.XnioIoThread.internalOpenTcpStreamConnection(XnioIoThread.java:254) at org.xnio.XnioIoThread.openStreamConnection(XnioIoThread.java:226) at org.xnio.XnioWorker.openStreamConnection(XnioWorker.java:405) at org.jboss.remoting3.remote.RemoteConnectionProvider.createSslConnection(RemoteConnectionProvider.java:246) at org.jboss.remoting3.remote.HttpUpgradeConnectionProvider.createSslConnection(HttpUpgradeConnectionProvider.java:135) at org.jboss.remoting3.remote.RemoteConnectionProvider.connect(RemoteConnectionProvider.java:206) at org.jboss.remoting3.EndpointImpl.connect(EndpointImpl.java:649) at org.jboss.remoting3.EndpointImpl.connect(EndpointImpl.java:561) at org.jboss.remoting3.EndpointImpl.connect(EndpointImpl.java:549) at org.jboss.remotingjmx.RemotingConnector.internalRemotingConnect(RemotingConnector.java:268) at org.jboss.remotingjmx.RemotingConnector.internalConnect(RemotingConnector.java:156) at org.jboss.remotingjmx.RemotingConnector.connect(RemotingConnector.java:103) at java.management/javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:270) at org.datadog.jmxfetch.Connection.createConnection(Connection.java:57) at org.datadog.jmxfetch.RemoteConnection.
(RemoteConnection.java:101) at org.datadog.jmxfetch.ConnectionFactory.createConnection(ConnectionFactory.java:40) at org.datadog.jmxfetch.Instance.getConnection(Instance.java:455) at org.datadog.jmxfetch.Instance.init(Instance.java:468) at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:15) at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:3) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) 2024-04-17 17:05:59 NZST | JMX | WARN | App | Unable to instantiate or initialize instance jboss_hcn for an unknown reason.null
What version of the Agent are you using?
0.48.0 - Agent 6.49.1 - Commit: 1790cab - Serialization version: v5.0.97 - Go version: go1.20.10 0.49.0 - Agent 6.52.1 - Commit: 51dd448 - Serialization version: v5.0.104 - Go version: go1.21.8
i also tried to reproduce disconnect connections and force this fail via iptables DROP. but it don't seem to be able to reproduce this reconnect issue
updating 0.49.0 server with Agent 7.52.1 - Commit: 51dd448 - Serialization version: v5.0.104 - Go version: go1.21.8
here is log if using iptables DROP
24-04-18 16:54:28 NZST | JMX | WARN | App | No instance could be initiated. Retrying initialization. 2024-04-18 16:54:28 NZST | JMX | INFO | App | Found 0 config files 2024-04-18 16:54:28 NZST | JMX | INFO | App | Cleaning up instances... 2024-04-18 16:54:28 NZST | JMX | INFO | InstanceCleanupTask | Trying to cleanup: jboss_hcn 2024-04-18 16:54:28 NZST | JMX | INFO | App | Dealing with YAML config instances... 2024-04-18 16:54:28 NZST | JMX | INFO | App | Dealing with Auto-Config instances collected... 2024-04-18 16:54:28 NZST | JMX | INFO | App | Instantiating instance for: jboss_wildfly 2024-04-18 16:54:28 NZST | JMX | INFO | App | Started instance initialization... 2024-04-18 16:54:28 NZST | JMX | INFO | Instance | Trying to connect to JMX Server at jboss_hcn 2024-04-18 16:54:28 NZST | JMX | INFO | Instance | Connection closed or does not exist. Attempting to create a new connection... 2024-04-18 16:54:28 NZST | JMX | INFO | ConnectionFactory | Connecting using JMX Remote 2024-04-18 16:54:28 NZST | JMX | INFO | Connection | Connecting to: service:jmx:remote+http://10.190.196.69:9990 2024-04-18 16:54:58 NZST | JMX | INFO | App | Completed instance initialization... 2024-04-18 16:54:58 NZST | JMX | WARN | App | Could not initialize instance: jboss_hcn: java.util.concurrent.ExecutionException: java.io.IOException: Operation failed with status WAITING at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) at org.datadog.jmxfetch.App.processRecoveryResults(App.java:998) at org.datadog.jmxfetch.App$5.invoke(App.java:944) at org.datadog.jmxfetch.tasks.TaskProcessor.processTasks(TaskProcessor.java:63) at org.datadog.jmxfetch.App.init(App.java:934) at org.datadog.jmxfetch.App.start(App.java:421) at org.datadog.jmxfetch.App.run(App.java:179) at org.datadog.jmxfetch.JmxFetch.main(JmxFetch.java:57) at org.datadog.jmxfetch.App.main(App.java:92) Caused by: java.io.IOException: Operation failed with status WAITING at org.jboss.remotingjmx.RemotingConnector.internalRemotingConnect(RemotingConnector.java:276) at org.jboss.remotingjmx.RemotingConnector.internalConnect(RemotingConnector.java:156) at org.jboss.remotingjmx.RemotingConnector.connect(RemotingConnector.java:103) at java.management/javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:270) at org.datadog.jmxfetch.Connection.createConnection(Connection.java:57) at org.datadog.jmxfetch.RemoteConnection.
(RemoteConnection.java:101) at org.datadog.jmxfetch.ConnectionFactory.createConnection(ConnectionFactory.java:40) at org.datadog.jmxfetch.Instance.getConnection(Instance.java:455) at org.datadog.jmxfetch.Instance.init(Instance.java:468) at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:15) at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:3) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840) 2024-04-18 16:54:58 NZST | JMX | WARN | App | Unable to instantiate or initialize instance jboss_hcn. Is the target JMX Server or JVM running? Operation failed with status WAITING 2024-04-18 16:55:05 NZST | JMX | INFO | App | update is in order - updating timestamp: 1713416105
it seem the issue was jmxfetch out of heap space. i think the OutOfMemory error should not be under debug mode error logging
here is before and after turning on/off debug mode
2024-04-12 11:51:15 NZST | JMX | INFO | Instance | Trying to collect bean list for the first time for JMX Server at jboss_hcn
2024-04-12 11:51:20 NZST | JMX | INFO | Instance | Connected to JMX Server at jboss_hcn with 3556 beans
2024-04-12 11:52:16 NZST | JMX | INFO | App | Completed instance initialization...
2024-04-12 11:52:16 NZST | JMX | WARN | App | Could not initialize instance: jboss_hcn:
java.util.concurrent.CancellationException
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:121)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.datadog.jmxfetch.App.processRecoveryResults(App.java:998)
at org.datadog.jmxfetch.App$5.invoke(App.java:944)
at org.datadog.jmxfetch.tasks.TaskProcessor.processTasks(TaskProcessor.java:63)
at org.datadog.jmxfetch.App.init(App.java:934)
at org.datadog.jmxfetch.App.start(App.java:421)
at org.datadog.jmxfetch.App.run(App.java:179)
at org.datadog.jmxfetch.JmxFetch.main(JmxFetch.java:57)
at org.datadog.jmxfetch.App.main(App.java:92)
Debug mode on
2024-04-27 08:09:10 NZST | JMX | INFO | Connection | Connecting to: service:jmx:remote+http://10.190.197.136:9990
2024-04-27 08:09:10 NZST | JMX | INFO | Instance | Trying to collect bean list for the first time for JMX Server at jboss_hcn
2024-04-27 08:09:10 NZST | JMX | DEBUG | Connection | Querying bean names on scope: java.lang:*
2024-04-27 08:09:10 NZST | JMX | DEBUG | Connection | Querying bean names on scope: java.nio:type=BufferPool,*
2024-04-27 08:09:10 NZST | JMX | DEBUG | Connection | Querying bean names on scope: jboss.as:*
Exception in thread "remoting-jmx client-thread-50" java.lang.OutOfMemoryError: Java heap space
2024-04-27 08:10:10 NZST | JMX | INFO | App | Completed instance initialization...
2024-04-27 08:10:10 NZST | JMX | WARN | App | Could not initialize instance: jboss_hcn:
java.util.concurrent.CancellationException
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:121)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.datadog.jmxfetch.App.processRecoveryResults(App.java:1024)
at org.datadog.jmxfetch.App$5.invoke(App.java:970)
at org.datadog.jmxfetch.tasks.TaskProcessor.processTasks(TaskProcessor.java:63)
at org.datadog.jmxfetch.App.init(App.java:960)
at org.datadog.jmxfetch.App.start(App.java:417)
at org.datadog.jmxfetch.App.run(App.java:181)
at org.datadog.jmxfetch.JmxFetch.main(JmxFetch.java:57)
at org.datadog.jmxfetch.App.main(App.java:94)
Updated issue title to reflect the issue better
Thanks for the detailed analysis of the issue. No 100% sure why that error message only appears if debug
enabled in the Agent. Will have a look at it.
@carlosroman Update
Increased agent memory from 200MB to 600MB, it is still dying. atm i am not sure is there some sort of memory leak happening in datadog agent
the first error produced have more detailed stack trace, but subsequence error do not have stack trace for heap space exception
2024-05-15 16:58:30 NZST | JMX | DEBUG | Connection | Querying bean names on scope: jboss.as:*
Exception in thread "remoting-jmx client-thread-82" java.lang.OutOfMemoryError: Java heap space
at java.base/java.lang.StringLatin1.toChars(StringLatin1.java:74)
at java.base/java.lang.String.toCharArray(String.java:4103)
at java.management/javax.management.ObjectName.construct(ObjectName.java:441)
at java.management/javax.management.ObjectName.readObject(ObjectName.java:1177)
at java.base/java.lang.invoke.LambdaForm$DMH/0x00007f7668200400.invokeSpecial(LambdaForm$DMH)
at java.base/java.lang.invoke.LambdaForm$MH/0x00007f766827c400.invoke(LambdaForm$MH)
at java.base/java.lang.invoke.LambdaForm$MH/0x00007f766823c800.invoke_MT(LambdaForm$MH)
at org.jboss.marshalling.reflect.JDKSpecific$SerMethods.callReadObject(JDKSpecific.java:171)
at org.jboss.marshalling.reflect.SerializableClass.callReadObject(SerializableClass.java:252)
at org.jboss.marshalling.river.RiverUnmarshaller.doInitSerializable(RiverUnmarshaller.java:1997)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadNewObject(RiverUnmarshaller.java:1601)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:441)
at org.jboss.marshalling.river.BlockUnmarshaller.readObject(BlockUnmarshaller.java:149)
at org.jboss.marshalling.river.BlockUnmarshaller.readObject(BlockUnmarshaller.java:135)
at org.jboss.marshalling.MarshallerObjectInputStream.readObjectOverride(MarshallerObjectInputStream.java:53)
at org.jboss.marshalling.river.RiverObjectInputStream.readObjectOverride(RiverObjectInputStream.java:307)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:500)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:467)
at java.base/java.util.HashSet.readObject(HashSet.java:345)
at java.base/java.lang.invoke.LambdaForm$DMH/0x00007f7668200400.invokeSpecial(LambdaForm$DMH)
at java.base/java.lang.invoke.LambdaForm$MH/0x00007f766827b000.invoke(LambdaForm$MH)
at java.base/java.lang.invoke.LambdaForm$MH/0x00007f766823c800.invoke_MT(LambdaForm$MH) at org.jboss.marshalling.reflect.JDKSpecific$SerMethods.callReadObject(JDKSpecific.java:171) at org.jboss.marshalling.reflect.SerializableClass.callReadObject(SerializableClass.java:252) at org.jboss.marshalling.river.RiverUnmarshaller.doInitSerializable(RiverUnmarshaller.java:1997) at org.jboss.marshalling.river.RiverUnmarshaller.doReadNewObject(RiverUnmarshaller.java:1601)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:441)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:373)
at org.jboss.marshalling.AbstractObjectInput.readObject(AbstractObjectInput.java:41)
at org.jboss.remotingjmx.protocol.v2.ClientCommon$MarshalledResponseHandler.readValue(ClientCommon.java:224) at org.jboss.remotingjmx.protocol.v2.ClientCommon$BaseResponseHandler.handle(ClientCommon.java:176) at org.jboss.remotingjmx.protocol.v2.ClientCommon$MessageReceiver$1.run(ClientCommon.java:123)2024-05-15 16:59:30 NZST | JMX | INFO | App | Completed instance initialization...2024-05-15 16:59:30 NZST | JMX | WARN | App | Could not initialize instance: jboss_hcn:
java.util.concurrent.CancellationException at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:121) at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.datadog.jmxfetch.App.processRecoveryResults(App.java:1024)
at org.datadog.jmxfetch.App$5.invoke(App.java:970)
at org.datadog.jmxfetch.tasks.TaskProcessor.processTasks(TaskProcessor.java:63)
at org.datadog.jmxfetch.App.init(App.java:960)
at org.datadog.jmxfetch.App.start(App.java:417)
at org.datadog.jmxfetch.App.run(App.java:181)
at org.datadog.jmxfetch.JmxFetch.main(JmxFetch.java:57)
at org.datadog.jmxfetch.App.main(App.java:94)
@jk2l side question, do you know how I can limit the number of Heatbeat JMX fetch threads?
@jk2l side question, do you know how I can limit the number of Heatbeat JMX fetch threads?
nope, i am still figuring how this whole thing work atm
It seem that the system at some point failed and it never able to recover. the only resolution at the moment is to restart datadog-agent. I am suspecting the bug is caused by https://github.com/DataDog/jmxfetch/pull/432 where during the clean up this.connection set to null but the instance never re-initialize the connection.
I am not familiar with Java as i am not java developer, but i am thinking will replacing
connection.getMBeanInfo
tothis.getConnection().getMBeanInfo
fix this issue. but the getConnection() probably need to set this.connection tooTested on:
one side effect happen is that i see two instances when it start to fail. but if it is running normally, it only show one instance