esi-neuroscience / acme

Asynchronous Computing Made ESI
https://esi-acme.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
11 stars 2 forks source link

ACME doesn't allow creation of a new client then crashes #47

Closed KatharineShapcott closed 1 year ago

KatharineShapcott commented 1 year ago

Sometimes I start a cluster but then the time runs out for the slurm cue and the workers are killed. If I then try and start a new cluster e.g.: acme.esi_cluster_setup(partition="8GBXS", n_workers=10, n_workers_startup=2, timeout=10, interactive_wait=1) I get a message like this:

Syncopy <ACME: esi_cluster_setup> Found existing parallel computing client <Client: 'tcp://10.100.32.17:40905' processes=0 threads=0, memory=0 B>. Not starting new cluster.

However then I try to use pmap with this client and it crashes:

RuntimeError: <ACMEdaemon> no active workers found in distributed computing cluster <Client: 'tcp://10.100.32.17:40905' processes=0 threads=0, memory=0 B> Consider running 
    import dask.distributed as dd; dd.get_client().restart()
If this fails to make workers come online, please use
    import acme; acme.cluster_cleanup()
to shut down any defunct distributed computing clients

Could the 0 active workers be detected and the empty client automatically cleaned up and replaced with the new one or similar? It's not a big deal though, I can easily work around it.

pantaray commented 1 year ago

Fixed by #48