dask / dask-jobqueue

Deploy Dask on job schedulers like PBS, SLURM, and SGE
https://jobqueue.dask.org
BSD 3-Clause "New" or "Revised" License
235 stars 142 forks source link

Make cli worker parameter flexible #606

Closed hmacdope closed 1 year ago

hmacdope commented 1 year ago

Fixes #229 Depends https://github.com/rapidsai/dask-cuda/pull/1181

Hi all!

Basic implementation of calling CLI's other than the distributed CLI.

Will require https://github.com/rapidsai/dask-cuda/pull/1181 to be merged first, adding a main entrypoint to dask-cuda-worker

I have had to add a bit of a shim to filter out some CLI args that are not shared between the dask-worker and dask-worker-cuda CLIs.

Very basic test to see that it works:

import dask
from dask import distributed
from dask_jobqueue.local import LocalCluster

lc_gpu = LocalCluster(worker_command="dask_cuda.cli", cores=2, memory="2GB")
client = distributed.Client(lc_gpu)
lc_gpu.scale(2)
print(lc_gpu.job_script())
> /home/hmacdope/anaconda3/envs/dask_dev/bin/python -m dask_cuda.cli tcp://192.168.1.5:33677 --name dummy-name --nthreads 1 --memory-limit 0.93GiB --death-timeout 60

I am a new contributor so please let me know if I am missing anything obvious.

hmacdope commented 1 year ago

Another option we could pursue is to just call them as scripts without python -m let me know what you think

hmacdope commented 1 year ago

@guillaumeeb hopefully I have addressed your comments.

guillaumeeb commented 1 year ago

We'll also wait till the CI is green!

hmacdope commented 1 year ago

This looks good to me! We don't need to wait for the Cuda issue to be fixed to merge, right?

It is merged!

Also, if at some point you could add a bit documentation about using this new option, this would be really nice!

Yes I will raise an issue.

hmacdope commented 1 year ago

@guillaumeeb any idea why the container builds are failing?

guillaumeeb commented 1 year ago

Nope, I need to check these. This might not be a problem, but the CI / build (none) failing is, could try to check this one?

guillaumeeb commented 1 year ago

I'm not sure what the problem is with the build, it looks like the image takes too much time to build, weird, more than 6 hours !!

jacobtomlinson commented 1 year ago

Dask/distributed recently dropped Python 3.8 and I noticed that the failing CI uses it, maybe that is the problem?

https://github.com/dask/distributed/blob/129b7cb70e2b77f4e13e27aefe9a7dbfc31a53e4/pyproject.toml#L26

hmacdope commented 1 year ago

@guillaumeeb need anything more from me here?

hmacdope commented 1 year ago

@guillaumeeb @jacobtomlinson @lesteve would be great to get this finalised if possible?

guillaumeeb commented 1 year ago

Sorry for the delay here @hmacdope, I wanted to check the CI but didn't have the time. I'll merge anyway. Would you need an official release at some point?

hmacdope commented 1 year ago

Sorry for the delay here @hmacdope, I wanted to check the CI but didn't have the time. I'll merge anyway. Would you need an official release at some point?

Thanks so much @guillaumeeb! No worries everyone is busy. :) all good without an official release.