archiver-appliance / epicsarchiverap

This is an implementation of an archiver for EPICS control systems that aims to archive millions of PVs.
Other
38 stars 37 forks source link

Hazelcast heartbeat timed out when archiving large waveforms #114

Closed carneirofc closed 3 years ago

carneirofc commented 3 years ago

Greetings, I am trying to archive a large set of waveforms and apparently when I reach a certain amout of PVs I am getting the following exception on the engine container:

2021-02-17 18:51:21,428 [hz.client_0.internal-2] WARN  com.hazelcast.client.connection.nio.ClientConnection  - hz.client_0 [archappl] [3.10.1] ClientConnection{alive=false, connectionId=1, channel=NioChannel{/10.128.255.3:43919->/10.128.255.3:12000}, remoteEndpoint=[10.128.255.3]:12000, lastReadTime=2021-02-17 18:51:21.167, lastWriteTime=2021-02-17 18:51:21.165, closedTime=2021-02-17 18:51:21.180, lastHeartbeatRequested=2021-02-17 18:50:15.765, lastHeartbeatReceived=2021-02-17 18:50:15.773, connected server version=3.10.1} closed. Reason: com.hazelcast.spi.exception.TargetDisconnectedException[Heartbeat timed out to owner connection ClientConnection{alive=true, connectionId=1, channel=NioChannel{/10.128.255.3:43919->/10.128.255.3:12000}, remoteEndpoint=[10.128.255.3]:12000, lastReadTime=2021-02-17 18:51:21.167, lastWriteTime=2021-02-17 18:51:21.165, closedTime=never, lastHeartbeatRequested=2021-02-17 18:50:15.765, lastHeartbeatReceived=2021-02-17 18:50:15.773, connected server version=3.10.1}]
com.hazelcast.spi.exception.TargetDisconnectedException: Heartbeat timed out to owner connection ClientConnection{alive=true, connectionId=1, channel=NioChannel{/10.128.255.3:43919->/10.128.255.3:12000}, remoteEndpoint=[10.128.255.3]:12000, lastReadTime=2021-02-17 18:51:21.167, lastWriteTime=2021-02-17 18:51:21.165, closedTime=never, lastHeartbeatRequested=2021-02-17 18:50:15.765, lastHeartbeatReceived=2021-02-17 18:50:15.773, connected server version=3.10.1}
        at com.hazelcast.client.connection.nio.DefaultClientConnectionStrategy.onHeartbeatStopped(DefaultClientConnectionStrategy.java:117)
        at com.hazelcast.client.connection.nio.ClientConnectionManagerImpl.heartbeatStopped(ClientConnectionManagerImpl.java:730)
        at com.hazelcast.client.connection.nio.HeartbeatManager.fireHeartbeatStopped(HeartbeatManager.java:139)
        at com.hazelcast.client.connection.nio.HeartbeatManager.checkConnection(HeartbeatManager.java:98)
        at com.hazelcast.client.connection.nio.HeartbeatManager.run(HeartbeatManager.java:85)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
        at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:64)
        at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:80)
2021-02-17 18:51:21,452 [hz.client_0.internal-2] INFO  com.hazelcast.client.connection.ClientConnectionManager  - hz.client_0 [archappl] [3.10.1] Removed connection to endpoint: [10.128.255.3]:12000, connection: ClientConnection{alive=false, connectionId=1, channel=NioChannel{/10.128.255.3:43919->/10.128.255.3:12000}, remoteEndpoint=[10.128.255.3]:12000, lastReadTime=2021-02-17 18:51:21.167, lastWriteTime=2021-02-17 18:51:21.437, closedTime=2021-02-17 18:51:21.180, lastHeartbeatRequested=2021-02-17 18:50:15.765, lastHeartbeatReceived=2021-02-17 18:50:15.773, connected server version=3.10.1}

For the container I am using the following JAVA_OPTS

-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -Xmx120G

The problem is happening when I reach about 250 LONG Waveforms at 1Hz with 100k points.

Is there any setting or configuration available that may help me? For the time being I am using the Fall 2018 Release

carneirofc commented 3 years ago

I manage to get things working with EPICS R7.0.5, Tomcat 9, OpenJDK 15 and the latest Archiver release. I still have no clue what was causing this issue.