BiBiServ / bibigrid

BiBiGrid is a tool for an easy cluster setup inside a cloud environment.
Apache License 2.0
11 stars 8 forks source link

Slurm Startup Issues Under Heavy Use #461

Open XaverStiensmeier opened 7 months ago

XaverStiensmeier commented 7 months ago

Connection (1/242)

Is not a huge issue given that the worker will be shutdown and then restarted.

Termination

Two machines with same name running when terminate is called (1/242)

Traceback (most recent call last):
  File "/usr/local/bin/delete_server.py", line 59, in <module>
    result = connections[worker_group["cloud_identifier"]].delete_server(terminate_worker)
  File "/usr/local/lib/python3.10/dist-packages/openstack/cloud/_compute.py", line 1203, in delete_server
    server = self.get_server(name_or_id, bare=True)
  File "/usr/local/lib/python3.10/dist-packages/openstack/cloud/_compute.py", line 518, in get_server
    server = _utils._get_entity(self, searchfunc, name_or_id, filters)
  File "/usr/local/lib/python3.10/dist-packages/openstack/cloud/_utils.py", line 201, in _get_entity
    raise exc.OpenStackCloudException(
openstack.exceptions.SDKException: Multiple matches found for bibigrid-worker-1ttfuh6793qbrky-55

Might lead to abandoned instances, but in this case the worker started before the system tried to start it again - maybe SLURM "forgot" to shut it down? Either terminate multiple instances or try to show slurm that an instance exists.

Invalid Node State (1/242)

Can probably be fixed by upgrading Slurm.

Connection pool is full

Can probably be fixed by increasing maxsize see https://stackoverflow.com/questions/53765366/urllib3-connectionpool-connection-pool-is-full-discarding-connection but is not really an issue as this doesn't mean that the connection is lost. Just that it isn't kept afterwards for easy reconnection.

OpenStack Instances Are Marked as ERROR or SHUTDOWN

SlurmSpoolDir full

Might be caused by temporary files

XaverStiensmeier commented 7 months ago

Invalid node state specified might be solved by upgrading to newest slurm version (23)

XaverStiensmeier commented 4 months ago

Currently testing whether setting a new tmp folder avoids the "SlurmSpoolDir".

XaverStiensmeier commented 2 weeks ago

Two machines with the same name might've been caused due to the standard SuspendTimeout of 30 seconds. Basically, when a node has been told to power down, after 30 seconds Slurm will believe it is powered down no matter the exit code.