Allow to start a Scheduler in a batch job

guillaumeeb commented 5 years ago

One of the goal of ClusterManager object is to be able to launch a remote scheduler. In dask-jobqueue scope, this probably means submitting a job which will start a Scheduler, and then connect to it.

We probably still lacks some remote interface between ClusterManager and scheduler object for this to work, so it will probably mean to extend APIs upstream.

Identified Scheduler method to provide:

retire_workers(n, minimum, key)
scheduler_info(), already existing, see if sufficient,
add_diagnostic_plugin(), and mostly retrive plugin information remotely

I suspect that adaptive will need to change significantly too, this will maybe lead to having a transitional adaptive logic in dask-jobqueue, and other remote function to add in scheduler.

This is in the scope of #170.

guillaumeeb commented 4 years ago

With #306 we've got a lot of what is needed to start a Scheduler in a dedicated batch job.

I also believe we've got all what is needed in Distributed side too, as dask-kubernetes don't use a LocalCluster anymore, see https://github.com/dask/dask-kubernetes/pull/162.

mrocklin commented 4 years ago

Yes, everything should be set up. We probably need to abstract out the FooJob classes to split between scheduler and workers. All of the SpecCluster infrastructure is there though.

On Fri, Oct 11, 2019 at 2:45 PM Guillaume Eynard-Bontemps < notifications@github.com> wrote:

With #306 https://github.com/dask/dask-jobqueue/issues/306 we've got a lot of what is needed to start a Scheduler in a dedicated batch job.

I also believe we've got all what is needed in Distributed side too, as dask-kubernetes don't use a LocalCluster anymore, see dask/dask-kubernetes#162 https://github.com/dask/dask-kubernetes/pull/162.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-jobqueue/issues/186?email_source=notifications&email_token=AACKZTBGJAU36PEH63LDH2DQODJVJA5CNFSM4GAH5VR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBBA2PI#issuecomment-541199677, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTE5KM76TSNEMWMSWBLQODJVJANCNFSM4GAH5VRQ .

lesteve commented 4 years ago

Just curious could you elaborate a bit more about why this would be useful. I have some guesses but I just want to make sure they are somewhat accurate.

I am guessing that the main advantage is that the scheduler does not run on the cluster login node anymore (most users around me run their own jupyter notebook server on the login node so the scheduler is on the login node as well, this is probably different with a JupyterHub setup). I do understand that doing that would be considered best practices by HPC sys-admin. I'd be curious if you have met some cases where the scheduler goes wild and consumes a lot more than expected or maybe there are a lot of people having their own Dask scheduler on the login node and it starts to add up.
IMO rather than the scheduler consuming more resources than it should what is a lot more likely to happen is that the client script will, for example because once you have your results from your job you do some not so light post-processing and you forget you are on the login node and not really supposed to do that.
I had some feed-back from users that told me that the ideal workflow for them would be to have the client code (the one that call client.submit) on their local machine (more flexible to install what you need, more familiar, nicer working conditions e.g. the shared folders are very slow on our cluster, think ipython takes 20-30s to launch sometimes and this is not something that is likely to change, etc ...) and be able to launch jobs on the cluster. Would such a workflow be made easier by the "remote scheduler" feature?

mrocklin commented 4 years ago

In practice I doubt that the scheduler will be expensive enough that system administrators will care. They all ask about this, but I don't think that it will be important in reality.

Another reason to support this is for networking rules. In some systems (10-20%?) compute nodes are unable to connect back to login nodes. So here placing the scheduler on a compute node, and then connecting to that node from the client/login node is nice.

It may be though that this is a frequently requested but not actually useful feature.

lesteve commented 4 years ago

Another reason to support this is for networking rules. In some systems (10-20%?) compute nodes are unable to connect back to login nodes.

In #354 @orbitfold seems in this particular case (at least for some clusters he has access too). @mrocklin could you give a few pointers that would help getting started on implementing a remote scheduler (my understanding is that using SpecCluster will help but I never had the time to investigate why exactly ...) ?

mrocklin commented 4 years ago

We need to make an object that follows this interface:

https://github.com/dask/distributed/blob/e7a2e6d41e0b719866769713d8f41cb5fcfbf6e8/distributed/deploy/spec.py#L22-L30

The simplest example today is probably SSHCluster: https://github.com/dask/distributed/blob/master/distributed/deploy/ssh.py

But rather than start things with SSH, it would presumably submit a job.

lesteve commented 4 years ago

Thanks a lot!

FWIW another report in #355 where @hawk-sf seems to have some limitation on allowed ports to connect login node and workers.

manuel-rhdt commented 4 years ago

In practice I doubt that the scheduler will be expensive enough that system administrators will care. They all ask about this, but I don't think that it will be important in reality.

Another reason to support this is for networking rules. In some systems (10-20%?) compute nodes are unable to connect back to login nodes. So here placing the scheduler on a compute node, and then connecting to that node from the client/login node is nice.

It may be though that this is a frequently requested but not actually useful feature.

I am currently using dask-joblib on a PBS cluster and running the scheduler on the login node. It is indeed a bit problematic because the login node has only 2gb of memory and it quickly runs out if I am not careful with the size of computation graphs.

So I think I would definitely benefit from this feature.

lesteve commented 4 years ago

Interesting, thanks for this use case! 2GB is certainly very small even more so when shared between all the cluster users. Are there some other nodes you can ssh to to do heavier work e.g. compilatiion of C++ code? On some clusters I am familiar with they are called devel nodes but the naming may well be different in your cluster.

One work-around in your use case is so start an interactive job where you launch your Dask scheduler i.e. you run your python script creating your PBSCluster. Beware: if you lose your Dask scheduler because your interactive job expires you lose all your computations see https://distributed.dask.org/en/latest/resilience.html#scheduler-failure for more details.

manuel-rhdt commented 4 years ago

The cluster I am using is very small and has only one type of compute node. I am not sure what you mean by an "interactive" job. Maybe what you say is that I should start the PBSCluster from a compute node. This does not work however, since it is impossible to submit new PBS jobs from a compute node (it can only be done from the login node).

The current workaround for me is to use a scheduler file to set up the communication between scheduler and workers using the shared file system (without using dask-jobqueue).

Ideally I could run a python script on the login node that launches PBS jobs for both dask-scheduler and dask-worker and then I connect to dask-scheduler from my local workstation to submit my computations (using a ssh tunnel).

lesteve commented 4 years ago

The cluster I am using is very small and has only one type of compute node. I am not sure what you mean by an "interactive" job.

An interactive job means you submit a job through your job scheduler and you end up in an interactive shell on a compute node. Look at this for an example on a cluster that use PBS.

This does not work however, since it is impossible to submit new PBS jobs from a compute node (it can only be done from the login node).

This is a mildly annoying restriction if you ask me, maybe try to talk to IT to see whether they would be ready to lift this restriction. People can still do it if they want using ssh login-node qsub. There was an issue where modifying submit_cmd to do that was reported to work, see https://github.com/dask/dask-jobqueue/issues/257#issuecomment-479556603.

The current workaround for me is to use a scheduler file to set up the communication between scheduler and workers using the shared file system (without using dask-jobqueue).

I am guessing you mean https://docs.dask.org/en/latest/setup/hpc.html#using-a-shared-network-file-system-and-a-job-scheduler. This seems like a reasonable work-around.

I connect to dask-scheduler from my local workstation to submit my computations (using a ssh tunnel).

I have had other people tell me a similar thing (third bullet point of https://github.com/dask/dask-jobqueue/issues/186#issuecomment-542027057). If you manage to make that work, please let us know (ideally in a separate issue).

manuel-rhdt commented 4 years ago

An interactive job means you submit a job through your job scheduler and you end up in an interactive shell on a compute node. Look at this for an example on a cluster that use PBS.

Thank you! I didn't know this and it is very useful!

This is a mildly annoying restriction if you ask me, maybe try to talk to IT to see whether they would be ready to lift this restriction. People can still do it if they want using ssh login-node qsub. There was an issue where modifying submit_cmd to do that was reported to work, see #257 (comment).

I'll see if I can talk to IT next year!

I am guessing you mean https://docs.dask.org/en/latest/setup/hpc.html#using-a-shared-network-file-system-and-a-job-scheduler. This seems like a reasonable work-around.

Yes this is what I meant. It does work well enough in my case because I do not need adaptive scaling of workers (yet). It is a little unfortunate to have to start the cluster separately (not from the notebook that I use for my computations).

lesteve commented 4 years ago

@muammar I see that you have commented in https://github.com/dask/dask-jobqueue/pull/390#issuecomment-603558844. Could you please explain the admin rules that are in place on your cluster just to get an idea what you are allowed to do on your cluster.

You may be interested by my answer above: https://github.com/dask/dask-jobqueue/issues/186#issuecomment-568265386. Let me try to some up:

launch the scheduler (i.e. create the Cluster object) in an interactive job: probably easier. If you like to work in a Jupyter environment, this is doable this way. There are a few hoops to jump through (mostly SSH port forwarding to open your Jupyter notebook in your local browser at localhost:<some-port>).
launch the scheduler (i.e. create the Cluster object) in a batch job. Only Python scripts not Jupyter environment.
Starting the Dask worker by launching Dask commands yourself: https://docs.dask.org/en/latest/setup/hpc.html#using-a-shared-network-file-system-and-a-job-scheduler

In both 1. and 2. you need to bear in mind that as soon as your scheduler job finishes, you will lose all your workers after ~60s. That may mean losing the result of lenghty computations.

dask / dask-jobqueue

Allow to start a Scheduler in a batch job #186