dask / dask-jobqueue

Deploy Dask on job schedulers like PBS, SLURM, and SGE
https://jobqueue.dask.org
BSD 3-Clause "New" or "Revised" License
235 stars 142 forks source link

Try out dask-jobqueue development version on different clusters #344

Closed lesteve closed 5 years ago

lesteve commented 5 years ago

There has been a non-trivial refactoring in #307 to use SpecCluster from distributed. It would be great if people could test this on their cluster to avoid unexpected problems when we released the next version of dask-jobqueue (probably a matter of one week or two):

To install the dask-jobqueue development version:

pip install `git+https://github.com/dask/dask-jobqueue`

You want to make sure that you have distributed >= 2.5.1 (see #342 for more details).

There is a list of known breaking changes: https://github.com/dask/dask-jobqueue/blob/master/docs/source/changelog.rst#changelog

If you spot some more breaking changes, let us know, so we can improve the changelog!

mivade commented 5 years ago

Looks good on HTCondor. I mainly checked that I can scale up and down the number of workers and submit to them.

lesteve commented 5 years ago

@mivade glad to hear that, thanks a lot for trying it out!

If you could also check simple adaptative use cases, that would be great, e.g. something like this:

cluster.adapt(minimum_jobs=0, maximum_jobs=2)
# func should take a few seconds so that the cluster needs to scale up
client.map(func, list_of_args)
# make sure that the cluster scales up with 2 jobs at most and then down to 0 jobs
mivade commented 5 years ago

Adaptive mostly works. The only issue I see is that I end up with a single worker after all jobs are complete even with minimum_jobs=0.

mrocklin commented 5 years ago

Do you perhaps still have a future lying around? If so Dask needs to keep a worker around to hold the results

client.futures

On Mon, Sep 30, 2019 at 11:33 AM Mike DePalatis notifications@github.com wrote:

Adaptive mostly works. The only issue I see is that I end up with a single worker after all jobs are complete even with minimum_jobs=0.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-jobqueue/issues/344?email_source=notifications&email_token=AACKZTFTMA5TDZMZPPE53Q3QMIS6LA5CNFSM4I32U4E2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD76IPIA#issuecomment-536643488, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTAF6TTISIONZQMOPWLQMIS6LANCNFSM4I32U4EQ .

mivade commented 5 years ago

Rookie mistake. That was exactly it, @mrocklin.

@lesteve let me know if there are any other tests I can do.

mrocklin commented 5 years ago

We must all be rookies then :)

On Mon, Sep 30, 2019 at 12:55 PM Mike DePalatis notifications@github.com wrote:

Rookie mistake. That was exactly it, @mrocklin https://github.com/mrocklin.

@lesteve https://github.com/lesteve let me know if there are any other tests I can do.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-jobqueue/issues/344?email_source=notifications&email_token=AACKZTAVQUTJ6LR25H65J6DQMI4Q5A5CNFSM4I32U4E2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD76QOTY#issuecomment-536676175, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTHTAFHHRJ346PZBUTLQMI4Q5ANCNFSM4I32U4EQ .

andersy005 commented 5 years ago

SLURMCluster: tested by @mrocklin. Not sure who else would be a good person.

@lesteve, at NCAR, we have another cluster that uses SLURM. I will give it a shot as well, and will report back how it goes.

lesteve commented 5 years ago

We must all be rookies then :)

I did not know about client.futures, nice! Count me in as a rookie too!

andersy005 commented 5 years ago

Looks good on both PBS and SLURM! Thanks @mrocklin and @lesteve for all the work in #307!

I have a few comments about the documentation.

1) In the past, I found the documentation on .scale() and .adapt() present under dask-jobqueue to provide very little detail on how to use these two features. As a workaround, I always referred to https://distributed.dask.org/en/latest/api.html#adaptive. If I understand this correctly, with the introduction of jobs argument, the .scale(), .adapt() from dask-jobqueue are now a bit different from .scale(), and adapt() from distributed. I was wondering if it would be useful to improve the documentation for these two functionalities with examples that showcase different ways to scale up/down a cluster with dask-jobqueue?

For instance, I think that it would useful to have some documentation on

cluster.adapt(minimum_jobs=0, maximum_jobs=20)

vs

cluster.adapt(minimum=0, maximum=20)

cluster.scale(10)

vs

cluster.scale(jobs=10)  # ask for 10 jobs
  1. The docstrings appear to be outdated:

Screen Shot 2019-10-01 at 9 09 51 AM Screen Shot 2019-10-01 at 9 04 21 AM

lesteve commented 5 years ago

Good point about doc: I am guessing this is because the docstring comes from SpecCluster. Adding a docstring to JobQueueCluster.scale and JobQueueCluster.adapt would be a very welcomed PR!

SimonBoothroyd commented 5 years ago

The latest version does not work on our LSF cluster as is (with bsub version 10.1.0.0) - changing the submit command (and method) to use bsub < as opposed to the new bsub fixes things (related to #328). I'm not sure if our cluster is just the odd one out (in which case I can just use a subclassed LSFJob and LSFCluster) but it might be good to add an option somehow to choose which command to use if this is more common.

mrocklin commented 5 years ago

Thanks for checking. Is this due to a version change in LSF? If so, is there a good way to automatically detect the LSF version?

On Tue, Oct 1, 2019 at 12:50 PM SimonBoothroyd notifications@github.com wrote:

The latest version does not work on our LSF cluster (with bsub version 10.1.0.0) - changing the submit command (and method) to use bsub < as opposed to the new 'bsub' fixes things (related to #328 https://github.com/dask/dask-jobqueue/issues/328). I'm not sure if our cluster is just the odd one out (in which case I can just use a subclassed LSFJob and LSFCluster) but it might be good to add an option somehow to choose which command to use if this is more common.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-jobqueue/issues/344?email_source=notifications&email_token=AACKZTF3IPDMGUGJU3M2SJDQMOEX5A5CNFSM4I32U4E2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEACEU2Q#issuecomment-537152106, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTEYQ53UZGIX6TX3YXLQMOEX5ANCNFSM4I32U4EQ .

lesteve commented 5 years ago

@SimonBoothroyd ah that is a very good point, thanks a lot for testing ! It did not cross my mind that this change went into #307 ... IIRC this was needed by @mrocklin when he was testing LSFCluster on Summit.

@mrocklin my understanding is:

It would be good to use the right command (+ code change if more change is needed) depending on the LSF version indeed.

@SimonBoothroyd could you post a diff that makes dask-jobqueue master work for your LSFCluster?

SimonBoothroyd commented 5 years ago

Is this due to a version change in LSF

I think mostly so as @lesteve points out. Strangely on our cluster bsub -V says it is version 10.1.0.0 yet has the same cli syntax as 9.X.X, so I guess the version number isn't guaranteed to match which command is required depending on how the sysadmin has configured / customised things. It would be good to include a manual switch to handle such (possibly?) edge cases.

could you post a diff that makes dask-jobqueue master work for your LSFCluster?

Sure, the diff for lsf.py is:

13c13
<     submit_command = "bsub <"
---
>     submit_command = "bsub"
93,96d92
< 
<     async def _submit_job(self, script_filename):
<         piped_cmd = [self.submit_command + " " + script_filename + " 2> /dev/null"]
<         return self._call(piped_cmd, shell=True)
mrocklin commented 5 years ago

Here is a possible solution for LSF's stdin problem: https://github.com/dask/dask-jobqueue/pull/347

SimonBoothroyd commented 5 years ago

Here is a possible solution for LSF's stdin problem: #347

Looks great!

jhamman commented 5 years ago

FWIW, I've been using the master branch for both PBS and Slurm. Things have been working quite smoothly so from my perspective, those two clusters can be checked off. I would love to see a release come out this week so, if there are things I can do to help make that happen let me know.

mrocklin commented 5 years ago

OK, things here seem to have quieted down. If no one has any strong objections I plan to release this tomorrow.

jhamman commented 5 years ago

@mrocklin - let me know if you need any help issuing the release.

mrocklin commented 5 years ago

I can do it, but I would also be very happy if someone else wanted to take it on. I'm generally in favor of making sure that more people feel comfortable issuing releases.

mrocklin commented 5 years ago

I'm going ahead with this now.

mrocklin commented 5 years ago

0.7 is up on PyPI. Closing.

lesteve commented 5 years ago

I was on holiday last week. Great to see that that 0.7 has been released. Thanks a lot for this everyone!