Open delgadom opened 5 years ago
Another useful feature:
[ ] a wrapper that handles the IOErrors that we sometimes get with large worker numbers and keeps submitting the function until it succeeds or reaches a Max retry num
I feel like some of this stuff could be developed on dask_kubernetes but maybe easier to get it up and running here and then see if it can be merged
yeah. this is just like a helper function that handles errors that frequently pop up in the chaos of crapton-of-workers land and then hammers the jobs until they complete?
I've actually found the cluster to be much more stable, even when running huge numbers of jobs. Have you encountered this recently?
nice! I haven't run a huge number of jobs in a long time (like since BR1 push). But yeah that's what I was thinking. I remember the IOError being the main issue. If we start experiencing this again, we can try to build in something like that maybe
Cluster spinup
Spin up for a cluster and (optionally) wait for workers to appear, with a progress bar. Optionally use as a context manager to spin down cluster after job execution.
Task management
Wait for futures to complete, with a progress bar
Other useful features