Closed lesteve closed 5 years ago
Looks good on HTCondor. I mainly checked that I can scale up and down the number of workers and submit to them.
@mivade glad to hear that, thanks a lot for trying it out!
If you could also check simple adaptative use cases, that would be great, e.g. something like this:
cluster.adapt(minimum_jobs=0, maximum_jobs=2)
# func should take a few seconds so that the cluster needs to scale up
client.map(func, list_of_args)
# make sure that the cluster scales up with 2 jobs at most and then down to 0 jobs
Adaptive mostly works. The only issue I see is that I end up with a single worker after all jobs are complete even with minimum_jobs=0
.
Do you perhaps still have a future lying around? If so Dask needs to keep a worker around to hold the results
client.futures
On Mon, Sep 30, 2019 at 11:33 AM Mike DePalatis notifications@github.com wrote:
Adaptive mostly works. The only issue I see is that I end up with a single worker after all jobs are complete even with minimum_jobs=0.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-jobqueue/issues/344?email_source=notifications&email_token=AACKZTFTMA5TDZMZPPE53Q3QMIS6LA5CNFSM4I32U4E2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD76IPIA#issuecomment-536643488, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTAF6TTISIONZQMOPWLQMIS6LANCNFSM4I32U4EQ .
Rookie mistake. That was exactly it, @mrocklin.
@lesteve let me know if there are any other tests I can do.
We must all be rookies then :)
On Mon, Sep 30, 2019 at 12:55 PM Mike DePalatis notifications@github.com wrote:
Rookie mistake. That was exactly it, @mrocklin https://github.com/mrocklin.
@lesteve https://github.com/lesteve let me know if there are any other tests I can do.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-jobqueue/issues/344?email_source=notifications&email_token=AACKZTAVQUTJ6LR25H65J6DQMI4Q5A5CNFSM4I32U4E2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD76QOTY#issuecomment-536676175, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTHTAFHHRJ346PZBUTLQMI4Q5ANCNFSM4I32U4EQ .
SLURMCluster: tested by @mrocklin. Not sure who else would be a good person.
@lesteve, at NCAR, we have another cluster that uses SLURM. I will give it a shot as well, and will report back how it goes.
We must all be rookies then :)
I did not know about client.futures
, nice! Count me in as a rookie too!
Looks good on both PBS and SLURM! Thanks @mrocklin and @lesteve for all the work in #307!
I have a few comments about the documentation.
1) In the past, I found the documentation on .scale()
and .adapt()
present under dask-jobqueue
to provide very little detail on how to use these two features. As a workaround, I always referred to https://distributed.dask.org/en/latest/api.html#adaptive. If I understand this correctly, with the introduction of jobs
argument, the .scale()
, .adapt()
from dask-jobqueue are now a bit different from .scale()
, and adapt()
from distributed. I was wondering if it would be useful to improve the documentation for these two functionalities with examples that showcase different ways to scale up/down a cluster with dask-jobqueue?
For instance, I think that it would useful to have some documentation on
cluster.adapt(minimum_jobs=0, maximum_jobs=20)
vs
cluster.adapt(minimum=0, maximum=20)
cluster.scale(10)
vs
cluster.scale(jobs=10) # ask for 10 jobs
Good point about doc: I am guessing this is because the docstring comes from SpecCluster
. Adding a docstring to JobQueueCluster.scale
and JobQueueCluster.adapt
would be a very welcomed PR!
The latest version does not work on our LSF cluster as is (with bsub version 10.1.0.0
) - changing the submit command (and method) to use bsub <
as opposed to the new bsub
fixes things (related to #328). I'm not sure if our cluster is just the odd one out (in which case I can just use a subclassed LSFJob and LSFCluster) but it might be good to add an option somehow to choose which command to use if this is more common.
Thanks for checking. Is this due to a version change in LSF? If so, is there a good way to automatically detect the LSF version?
On Tue, Oct 1, 2019 at 12:50 PM SimonBoothroyd notifications@github.com wrote:
The latest version does not work on our LSF cluster (with bsub version 10.1.0.0) - changing the submit command (and method) to use bsub < as opposed to the new 'bsub' fixes things (related to #328 https://github.com/dask/dask-jobqueue/issues/328). I'm not sure if our cluster is just the odd one out (in which case I can just use a subclassed LSFJob and LSFCluster) but it might be good to add an option somehow to choose which command to use if this is more common.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-jobqueue/issues/344?email_source=notifications&email_token=AACKZTF3IPDMGUGJU3M2SJDQMOEX5A5CNFSM4I32U4E2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEACEU2Q#issuecomment-537152106, or mute the thread https://github.com/notifications/unsubscribe-auth/AACKZTEYQ53UZGIX6TX3YXLQMOEX5ANCNFSM4I32U4EQ .
@SimonBoothroyd ah that is a very good point, thanks a lot for testing ! It did not cross my mind that this change went into #307 ... IIRC this was needed by @mrocklin when he was testing LSFCluster
on Summit.
@mrocklin my understanding is:
It would be good to use the right command (+ code change if more change is needed) depending on the LSF version indeed.
@SimonBoothroyd could you post a diff that makes dask-jobqueue
master work for your LSFCluster
?
Is this due to a version change in LSF
I think mostly so as @lesteve points out. Strangely on our cluster bsub -V
says it is version 10.1.0.0
yet has the same cli syntax as 9.X.X, so I guess the version number isn't guaranteed to match which command is required depending on how the sysadmin has configured / customised things. It would be good to include a manual switch to handle such (possibly?) edge cases.
could you post a diff that makes dask-jobqueue master work for your LSFCluster?
Sure, the diff for lsf.py
is:
13c13
< submit_command = "bsub <"
---
> submit_command = "bsub"
93,96d92
<
< async def _submit_job(self, script_filename):
< piped_cmd = [self.submit_command + " " + script_filename + " 2> /dev/null"]
< return self._call(piped_cmd, shell=True)
Here is a possible solution for LSF's stdin problem: https://github.com/dask/dask-jobqueue/pull/347
Here is a possible solution for LSF's stdin problem: #347
Looks great!
FWIW, I've been using the master branch for both PBS and Slurm. Things have been working quite smoothly so from my perspective, those two clusters can be checked off. I would love to see a release come out this week so, if there are things I can do to help make that happen let me know.
OK, things here seem to have quieted down. If no one has any strong objections I plan to release this tomorrow.
@mrocklin - let me know if you need any help issuing the release.
I can do it, but I would also be very happy if someone else wanted to take it on. I'm generally in favor of making sure that more people feel comfortable issuing releases.
I'm going ahead with this now.
0.7 is up on PyPI. Closing.
I was on holiday last week. Great to see that that 0.7 has been released. Thanks a lot for this everyone!
There has been a non-trivial refactoring in #307 to use
SpecCluster
fromdistributed
. It would be great if people could test this on their cluster to avoid unexpected problems when we released the next version ofdask-jobqueue
(probably a matter of one week or two):PBSCluster
: @guillaumeeb @andersy005. This has already been tested while #307 was being worked on I think.HTCondor
: @matyasselmeci @cloudronin @mivadeLSFCluster
: tested by @mrocklin (on Summit I think). Not sure who else would be a good person.SLURMCluster
: tested by @mrocklin. Not sure who else would be a good person.SGECluster
: I can test this. Maybe @ericmjl can try it out as well.OARCluster
: I can test this.To install the
dask-jobqueue
development version:You want to make sure that you have
distributed >= 2.5.1
(see #342 for more details).There is a list of known breaking changes: https://github.com/dask/dask-jobqueue/blob/master/docs/source/changelog.rst#changelog
If you spot some more breaking changes, let us know, so we can improve the changelog!