Sometimes heartbeating will mark an agent for expiry that's still in WAITING_FOR_HELLO. This can happen in the case where an agent can be stuck in a PBS/SLURM queue.
Options:
Make AgentScheduler#onAgentExpiry() accept launching agents.
It is up to the scheduler to handle this.
The agent may launch and connect later.
In Master#doExpire() call AgentScheduler#onAgentLaunchFailure().
The actuator may not know about the expiry, which causes issues.
Ask the actuator:
Something like this in Actuator:
/** Agent status from an actuator's POV. */
enum AgentStatus {
/** Agent is still launching. May be stuck in a queue. */
Launching,
/** Agent has launched, but not connected yet. */
Launched,
/** Agent has connected. */
Connected,
/** Agent has disconnected. */
Disconnected,
/** Unknown. The agent may not be ours, or we have stopped tracking it. */
Unknown
}
default AgentStatus queryStatus(UUID uuid) {
return AgentStatus.Unknown;
}
If the state is Launching, then either do nothing or extend the waiting time. Otherwise continue as normal.
Sometimes heartbeating will mark an agent for expiry that's still in
WAITING_FOR_HELLO
. This can happen in the case where an agent can be stuck in a PBS/SLURM queue.Options:
Make
AgentScheduler#onAgentExpiry()
accept launching agents.In
Master#doExpire()
callAgentScheduler#onAgentLaunchFailure()
.Ask the actuator:
Something like this in
Actuator
:If the state is
Launching
, then either do nothing or extend the waiting time. Otherwise continue as normal.