XaverStiensmeier commented 7 months ago

Connection (1/242)

[ ] Sometimes Connection fails due to connectivity problems within OpenStack

Is not a huge issue given that the worker will be shutdown and then restarted.

Termination

[ ] Two machines with same name running when terminate is called

Two machines with same name running when terminate is called (1/242)

Traceback (most recent call last):
  File "/usr/local/bin/delete_server.py", line 59, in <module>
    result = connections[worker_group["cloud_identifier"]].delete_server(terminate_worker)
  File "/usr/local/lib/python3.10/dist-packages/openstack/cloud/_compute.py", line 1203, in delete_server
    server = self.get_server(name_or_id, bare=True)
  File "/usr/local/lib/python3.10/dist-packages/openstack/cloud/_compute.py", line 518, in get_server
    server = _utils._get_entity(self, searchfunc, name_or_id, filters)
  File "/usr/local/lib/python3.10/dist-packages/openstack/cloud/_utils.py", line 201, in _get_entity
    raise exc.OpenStackCloudException(
openstack.exceptions.SDKException: Multiple matches found for bibigrid-worker-1ttfuh6793qbrky-55

Might lead to abandoned instances, but in this case the worker started before the system tried to start it again - maybe SLURM "forgot" to shut it down? Either terminate multiple instances or try to show slurm that an instance exists.

Invalid Node State (1/242)

[ ] Likely scontrol update NodeName="$1" state=RESUME reason=FailedStartup failed with Invalid node state specified
```
slurm_update error: Invalid node state specified
```

Can probably be fixed by upgrading Slurm.

Connection pool is full

[ ] Connection pool is full, discarding connection: openstack.cebitec.uni-bielefeld.de

Can probably be fixed by increasing maxsize see https://stackoverflow.com/questions/53765366/urllib3-connectionpool-connection-pool-is-full-discarding-connection but is not really an issue as this doesn't mean that the connection is lost. Just that it isn't kept afterwards for easy reconnection.

OpenStack Instances Are Marked as ERROR or SHUTDOWN

[ ] ERROR - doesn't seem to affect run and might be caused by two machines running with the same name and then trying to terminate them by name (4/242; but only same node name)
[ ] SHUTDOWN - affects run because node is marked down. Unsure what causes this (1/242)

SlurmSpoolDir full

Might be caused by temporary files

XaverStiensmeier commented 7 months ago

Invalid node state specified might be solved by upgrading to newest slurm version (23)

XaverStiensmeier commented 4 months ago

Currently testing whether setting a new tmp folder avoids the "SlurmSpoolDir".

XaverStiensmeier commented 2 weeks ago

Two machines with the same name might've been caused due to the standard SuspendTimeout of 30 seconds. Basically, when a node has been told to power down, after 30 seconds Slurm will believe it is powered down no matter the exit code.

BiBiServ / bibigrid

Slurm Startup Issues Under Heavy Use #461