esi-neuroscience / acme

Asynchronous Computing Made ESI
https://esi-acme.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
11 stars 2 forks source link

Make user interface more dynamic #12

Closed KatharineShapcott closed 3 years ago

KatharineShapcott commented 3 years ago

Hi Stefan, It looks to me like the code doesn't recognize if the number of workers change due to a timeout. At least the printouts don't change. Nothing crashed though so maybe it doesn't matter?

<ParallelMap> INFO: <esi_cluster_setup> SLURM jobs could not be started within given time-out interval of 180 seconds
<esi_cluster_setup> Do you want to [k]eep waiting for 60s, [a]bort or [c]ontinue with 462 workers? c
<ParallelMap> INFO: <esi_cluster_setup> Cluster dashboard accessible at http://10.100.32.7:8787/status
<ParallelMap> INFO: Preparing 500 parallel calls of `comparison_classifier` using 500 workers
<ParallelMap> INFO: Log information available at /mnt/hpx/slurm/shapcottk/shapcottk_20201127-193128
100% || 100/100 [28:45<00:00]
<ParallelMap> INFO: SUCCESS! Finished parallel computation. Results have been saved to /mnt/hpx/home/shapcottk/ACME_20201127-193128-786516
<ParallelMap> INFO: <cluster_cleanup> Successfully shut down cluster shapcottk_20201127-193128 containing 500 workers

~PS I think 180 seconds might be a bit too long for the timeout, maybe we can switch to 60s? Because it happens every time if you have more than a few hundred jobs. Or we could display these options after 60s and if nothing happens for another 120s we could automatically continue?~ see #13 ~Or in slurmfun jobs would all go in the cue and run when resources became available. Maybe we should be using cluster.adapt() instead?~ see #14

KatharineShapcott commented 3 years ago

Sorry that was a lot of suggestions! I'd be happy to test out cluster.adapt() if you think it would be useful :)

pantaray commented 3 years ago

Hey Katharine! Thanks for reporting this! Unfortunately, the current setup is pretty rigid with respect to active workers (the code only checks for active workers at startup, then it just assumes everything is fine). So yes, I'd really appreciated If you would be willing to test cluster.adapt() - as you pointed out, I think that might be the better way to go for large worker counts (it has been a litle finicky w/SLURM in the past, but in the meantime dask as well as dask_jobqueue got pretty substantial updates, so def. worth checking it out again). RE: timeouts - I totally agree. If you don't mind, I'll go ahead and split up this issue into several bug-reports/feature requests.