Here are logs where after snapshot restore Hazelcast was still alive for ~31 seconds while Felix Framework gave up waiting after 10 seconds - and tied to boot up XP. XP could not start again - because Hazelcast port (5701) was still occupied.
024-05-30T17:58:39,181 INFO c.e.x.l.i.framework.FrameworkService - Resetting server...
2024-05-30T17:58:39,181 INFO c.e.x.l.i.framework.FrameworkService - Server has been reset..
2024-05-30T18:05:52,941 INFO c.e.x.l.i.framework.FrameworkService - Restarting server...
2024-05-30T18:05:52,941 INFO c.e.x.l.i.framework.FrameworkService - Stopping server...
2024-05-30T18:06:02,946 INFO c.e.x.l.i.framework.FrameworkService - Starting Enonic XP...
2024-05-30T18:06:08,337 ERROR com.hazelcast.instance.AddressPicker - [LOCAL] [3.12.13#OSS] [3.12.13] Cannot bind to a given address: /10.186.32.60. Hazelcast cannot start. Port [5701] is already in use and auto-increment is disabled.
com.hazelcast.core.HazelcastException: Cannot bind to a given address: /10.186.32.60. Hazelcast cannot start. Port [5701] is already in use and auto-increment is disabled.
2024-05-30T18:06:02,946 INFO c.e.x.l.i.framework.FrameworkService - Server has been stopped
2024-05-30T18:06:02,946 INFO c.e.x.l.i.framework.FrameworkService - Starting Enonic XP...
2024-05-30T18:06:09,795 INFO c.h.i.partition.impl.MigrationThread - [10.186.32.60]:5701 [3.12.13#OSS] [3.12.13] All migration tasks have been completed. (lastRepartitionTime=Thu May 30 18:06:05 CEST 2024, completedMigrations=1686, totalCompletedMigrations=2328, elapsedMigrationTime=3588ms, totalElapsedMigrationTime=3588ms)
2024-05-30T18:06:23,780 INFO com.hazelcast.instance.Node - [10.186.32.60]:5701 [3.12.13#OSS] [3.12.13] Shutting down node engine...
2024-05-30T18:06:23,956 INFO org.elasticsearch.node - [b31apvl00112] stopped
2024-05-30T18:06:23,956 INFO org.elasticsearch.node - [b31apvl00112] closing ...
2024-05-30T18:06:23,970 INFO org.elasticsearch.node - [b31apvl00112] closed
Solution: increase XP wait (FrameworkService.WAIT_FOR_STOP_TIMEOUT_MS) to 600 seconds for restart. And don't attempt to start the server, it shutdown timed out - it might end up in a bad condition (because other Java threads not belonging to OSGi might still be alive) and requires manual restart.
Here are logs where after snapshot restore Hazelcast was still alive for ~31 seconds while Felix Framework gave up waiting after 10 seconds - and tied to boot up XP. XP could not start again - because Hazelcast port (5701) was still occupied.
Solution: increase XP wait (
FrameworkService.WAIT_FOR_STOP_TIMEOUT_MS
) to 600 seconds for restart. And don't attempt to start the server, it shutdown timed out - it might end up in a bad condition (because other Java threads not belonging to OSGi might still be alive) and requires manual restart.