The user experience of the problem is that a newly submitted job does not make progress, and no further jobs can be added. Relevant log entry of the stiltweb service:
Apr 10 23:54:58 fsicos2.lunarc.lu.se java[3672449]: [WARN] [04/10/2023 23:54:58.743] [StiltBoss-akka.remote.default-remote-dispatcher-6] [akka.stream.Log(akka://StiltBoss/system/Materializers/StreamSupervisor-1)] [outbound connection to [akka://WorkMaster@icos1.wg-fsicos2:2561], control stream] Upstream failed, cause: StreamTcpException: The connection has been aborted
And a relevant log entry of the stiltcluster service:
Apr 11 10:56:56 icos1.gis.lu.se java[2315137]: [WARN] [04/11/2023 10:56:56.011] [WorkMaster-akka.remote.default-remote-dispatcher-5] [akka.stream.Log(akka://WorkMaster/system/Materializers/StreamSupervisor-1)] [outbound connection to [akka://StiltBoss@fsicos2.wg-fsicos2:2550], message stream] Upstream failed, cause: StreamTcpException: The connection has been aborted
It seems likely that the reason for connection interruption is idle timeout. In this case adding a periodic "keepalive" message exchange between WorkMaster and WorkReceptionist should solve the problem.
The user experience of the problem is that a newly submitted job does not make progress, and no further jobs can be added. Relevant log entry of the stiltweb service:
And a relevant log entry of the stiltcluster service:
It seems likely that the reason for connection interruption is idle timeout. In this case adding a periodic "keepalive" message exchange between WorkMaster and WorkReceptionist should solve the problem.