UQ-RCC / nimrodg

Nimrod/G
https://rcc.uq.edu.au/nimrod
Apache License 2.0
1 stars 0 forks source link

Handle long-spawning agents. #34

Closed vs49688 closed 4 years ago

vs49688 commented 4 years ago

Sometimes heartbeating will mark an agent for expiry that's still in WAITING_FOR_HELLO. This can happen in the case where an agent can be stuck in a PBS/SLURM queue.

Options:

  1. Make AgentScheduler#onAgentExpiry() accept launching agents.

    • It is up to the scheduler to handle this.
    • The agent may launch and connect later.
  2. In Master#doExpire() call AgentScheduler#onAgentLaunchFailure().

    • The actuator may not know about the expiry, which causes issues.
  3. Ask the actuator:

Something like this in Actuator:

/** Agent status from an actuator's POV. */
enum AgentStatus {
    /** Agent is still launching. May be stuck in a queue. */
    Launching,
    /** Agent has launched, but not connected yet. */
    Launched,
    /** Agent has connected. */
    Connected,
    /** Agent has disconnected. */
    Disconnected,
    /** Unknown. The agent may not be ours, or we have stopped tracking it. */
    Unknown
}

default AgentStatus queryStatus(UUID uuid) {
    return AgentStatus.Unknown;
}

If the state is Launching, then either do nothing or extend the waiting time. Otherwise continue as normal.