enonic / xp

Enonic XP
https://enonic.com
GNU General Public License v3.0
198 stars 34 forks source link

Hazelcast may take much longer time to restart - need to wait longer on snapshot restore #10605

Closed rymsha closed 6 days ago

rymsha commented 1 week ago

Here are logs where after snapshot restore Hazelcast was still alive for ~31 seconds while Felix Framework gave up waiting after 10 seconds - and tied to boot up XP. XP could not start again - because Hazelcast port (5701) was still occupied.

024-05-30T17:58:39,181 INFO  c.e.x.l.i.framework.FrameworkService - Resetting server...
2024-05-30T17:58:39,181 INFO  c.e.x.l.i.framework.FrameworkService - Server has been reset..

2024-05-30T18:05:52,941 INFO  c.e.x.l.i.framework.FrameworkService - Restarting server...
2024-05-30T18:05:52,941 INFO  c.e.x.l.i.framework.FrameworkService - Stopping server...

2024-05-30T18:06:02,946 INFO  c.e.x.l.i.framework.FrameworkService - Starting Enonic XP...
2024-05-30T18:06:08,337 ERROR com.hazelcast.instance.AddressPicker - [LOCAL] [3.12.13#OSS] [3.12.13] Cannot bind to a given address: /10.186.32.60. Hazelcast cannot start. Port [5701] is already in use and auto-increment is disabled.
com.hazelcast.core.HazelcastException: Cannot bind to a given address: /10.186.32.60. Hazelcast cannot start. Port [5701] is already in use and auto-increment is disabled.

2024-05-30T18:06:02,946 INFO  c.e.x.l.i.framework.FrameworkService - Server has been stopped
2024-05-30T18:06:02,946 INFO  c.e.x.l.i.framework.FrameworkService - Starting Enonic XP...

2024-05-30T18:06:09,795 INFO  c.h.i.partition.impl.MigrationThread - [10.186.32.60]:5701 [3.12.13#OSS] [3.12.13] All migration tasks have been completed. (lastRepartitionTime=Thu May 30 18:06:05 CEST 2024, completedMigrations=1686, totalCompletedMigrations=2328, elapsedMigrationTime=3588ms, totalElapsedMigrationTime=3588ms)

2024-05-30T18:06:23,780 INFO  com.hazelcast.instance.Node - [10.186.32.60]:5701 [3.12.13#OSS] [3.12.13] Shutting down node engine...
2024-05-30T18:06:23,956 INFO  org.elasticsearch.node - [b31apvl00112] stopped
2024-05-30T18:06:23,956 INFO  org.elasticsearch.node - [b31apvl00112] closing ...
2024-05-30T18:06:23,970 INFO  org.elasticsearch.node - [b31apvl00112] closed

Solution: increase XP wait (FrameworkService.WAIT_FOR_STOP_TIMEOUT_MS) to 600 seconds for restart. And don't attempt to start the server, it shutdown timed out - it might end up in a bad condition (because other Java threads not belonging to OSGi might still be alive) and requires manual restart.