Open XaverStiensmeier opened 7 months ago
Invalid node state specified might be solved by upgrading to newest slurm version (23)
Currently testing whether setting a new tmp folder avoids the "SlurmSpoolDir".
Two machines with the same name might've been caused due to the standard SuspendTimeout of 30 seconds. Basically, when a node has been told to power down, after 30 seconds Slurm will believe it is powered down no matter the exit code.
Connection (1/242)
Is not a huge issue given that the worker will be shutdown and then restarted.
Termination
Two machines with same name running when terminate is called (1/242)
Might lead to abandoned instances, but in this case the worker started before the system tried to start it again - maybe SLURM "forgot" to shut it down? Either terminate multiple instances or try to show slurm that an instance exists.
Invalid Node State (1/242)
scontrol update NodeName="$1" state=RESUME reason=FailedStartup
failed with Invalid node state specifiedCan probably be fixed by upgrading Slurm.
Connection pool is full
Can probably be fixed by increasing maxsize see https://stackoverflow.com/questions/53765366/urllib3-connectionpool-connection-pool-is-full-discarding-connection but is not really an issue as this doesn't mean that the connection is lost. Just that it isn't kept afterwards for easy reconnection.
OpenStack Instances Are Marked as ERROR or SHUTDOWN
SlurmSpoolDir full
Might be caused by temporary files