dask / dask-jobqueue

Deploy Dask on job schedulers like PBS, SLURM, and SGE
https://jobqueue.dask.org
BSD 3-Clause "New" or "Revised" License
235 stars 143 forks source link

Dask with jobqueue not using multiple nodes #206

Closed npaulson closed 5 years ago

npaulson commented 5 years ago

I am trying to use Dask to do parallel processing on multiple nodes on supercomputing resources - yet the Dask-distributed map only takes advantage of one of the nodes. Note that I put this up on stackoverflow but didn't get attention so now I'm giving here a go.

Here is a test script I am using to set up the client and perform a simple operation:

import time
from distributed import Client
from dask_jobqueue import SLURMCluster
from socket import gethostname

def slow_increment(x):
    time.sleep(10)
    return [x + 1, gethostname(), time.time()]

cluster = SLURMCluster(
    queue='somequeue',
    cores=2,
    memory='128GB',
    project='someproject',
    walltime='00:05:00',
    job_extra=['-o myjob.%j.%N.out',
               '-e myjob.%j.%N.error'],
    env_extra=['export I_MPI_FABRICS=dapl',
               'source activate dask-jobqueue'])

cluster.scale(2)

client = Client(cluster)

A = client.map(slow_increment, range(8))
B = client.gather(A)

print(client)

for res in B:
    print(res)

client.close()

And here is the output:

<Client: scheduler='tcp://someip' processes=2 cores=4>
[1, 'bdw-0478', 1540477582.6744401]
[2, 'bdw-0478', 1540477582.67487]
[3, 'bdw-0478', 1540477592.68666]
[4, 'bdw-0478', 1540477592.6879778]
[5, 'bdw-0478', 1540477602.6986163]
[6, 'bdw-0478', 1540477602.6997452]
[7, 'bdw-0478', 1540477612.7100565]
[8, 'bdw-0478', 1540477612.711296]

While printing out the client info indicates that Dask has the correct number of nodes (processes) and tasks per node (cores), the socket.gethostname() output and time-stamps indicate that the second node isn't used. I do know that dask-jobqueue successfully requested two nodes, and that both jobs complete at the same time. I tried using different MPI Fabrics for inter- and intra-node communication (e.g. tcp, shm:tcp, shm:ofa, ofa, ofi, dapl) but this did not change the result. I also tried removing the "export I_MPI_FABRICS" command and using the "interface" option, but this caused the code to hang.

Thanks in advance for any assistance.

-Noah

fabiansvara commented 4 years ago

Finally found some time to experiment with this.

Here's my complete testing code:

from dask_jobqueue import SLURMCluster
from dask.distributed import Client
import time
import itertools as it

def _worker(worker_id):
    print(f'{worker_id=} starting.')
    time.sleep(10)
    print(f'{worker_id=} done.')

def dask_map(wait=True, with_resource=True, cores=20, workers=5):
    if with_resource:
        extra = ['--resources GPU=1']
        resources = {'GPU': 1}
    else:
        extra = []
        resources = {}

    cluster = SLURMCluster(
        cores=cores,
        processes=1,
        memory='188GB',
        walltime='24:00:00',
        queue='my.queue',
        shebang='#!/bin/bash -l',
        local_directory='/my/local/dir/',
        extra=extra,
        job_extra=['--gres=gpu:1', ],
        env_extra=[
            'module load cuda',
            'module load anaconda',
            'conda init bash',
            'conda activate /my/conda/env',])
    cluster.scale(workers)
    client = Client(cluster)

    if wait:
        client.wait_for_workers(workers)
    f = client.map(_worker, list(range(workers)), resources=resources)
    [xx.result() for xx in f]

    cluster.scale(n=0, jobs=0)
    cluster.close()

if __name__ == '__main__':
    for wait, with_resource, cores in it.product([False, True], [True, False], [20, 1]):
        t0 = time.time()
        dask_map(wait=wait, with_resource=with_resource, cores=cores)
        total_t = time.time() - t0
        print(f'{wait=}, {with_resource=}, {cores=} took {total_t:.2f} s')

This prints the following:

wait=False, with_resource=True, cores=20 took 55.11 s
wait=False, with_resource=True, cores=1 took 14.09 s
wait=False, with_resource=False, cores=20 took 13.84 s
wait=False, with_resource=False, cores=1 took 14.29 s
wait=True, with_resource=True, cores=20 took 15.28 s
wait=True, with_resource=True, cores=1 took 14.45 s
wait=True, with_resource=False, cores=20 took 14.58 s
wait=True, with_resource=False, cores=1 took 14.43 s

(the individual lines are separated by

distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError

... I removed that from the output above for clarity. This seems to be caused be the creation / closing of the cluster in the loop; it doesn't influence the results, which are identical when this is not run in a loop.)

Anyway, as you can see, the issue only occurs in the wait=False, with_resource=True, cores=20 case, where the jobs run sequentially. The workers are all there, but only one of them is doing anything. This is what it looks like in the dashboard:

dask_dashboard

lesteve commented 4 years ago

Thanks a lot for doing this! So it looks like work-stealing with resources does happen when cores=1, which is good to confirm.

The case where it does not work is when you have cores=20. I am not an expert on the work stealing logic, but I can only guess that work-stealing does not kick in because the scheduler sees 20 slots on the first node (one slot for each core) with 5 tasks to execute so the worker doesn't seem busy and work stealing does not seem worth it. It could be interesting to confirm this hypothesis by submitting many more short tasks and see whether work stealing happens. It could also be worth getting some inputs by @mrocklin to make sure my hypothesis seems reasonable to him ...

Maybe this is something that can be improved in distributed, not sure. I guess it should be possible to create a reproducer with distributed.SSHCluster so that more people are likely to look at it.

fabiansvara commented 4 years ago

I looked into the documentation but couldn't find out how to create an SSHCluster with resources, not sure I can be of any help there. Should I create an issue specifically about work stealing on the distributed tracker?

appassionate commented 1 year ago

@fabiansvara @lesteve Hi, sorry for discussing this in this old issue, any news in this using of job-queue? In my case, i want to map more "MPI" jobs in client by cluster.scale.

Dask scheduler seems to manage task by the total threads in the workers. If cores to be 32(my hpc cores per node) ,there will be 32 threads in a worker, and any client.submit method will be executed as one thread ( i guess). So It is tricky to set cores to be 1, let the scheduler know there only one thread in a work, but actually, task in this worker will launch a MPI job, which might be "work-stealing " you are talking about? thanks.

Here is my cluster config.

        cluster = SLURMCluster(
            n_workers=1,
            processes=1,
            queue=queue,
            interface="ib0",
            walltime=max_run_time,
            cores=1, #trick for using this slurm cluster
            memory="100 GB",
            job_extra_directives=[f'--ntasks-per-node={core_per_node}'],
            #job_cpu=1,
            )
        print(cluster.job_script())

Is there way that dask scheduler not manage submit task by threads but the worker_num? Any suggestions will be helpful! thank u! :)

ocaisa commented 1 year ago

@appassionate We created a library that does what you are asking for but I admit it requires quite a bit of configuration since you need to tell it about the system and how you launch MPI jobs there. The library is at https://github.com/E-CAM/jobqueue_features and there's a tutorial at https://github.com/E-CAM/jobqueue_features_workshop_materials (and you can find a recording of the tutorial at https://www.youtube.com/watch?v=FpMua8iJeTk&ab_channel=E-CAM).

I haven't touched it in a few months, need to check if our CI is still passing. That package is working with the latest version of jobqueue (0.8.1)

appassionate commented 1 year ago

@appassionate We created a library that does what you are asking for but I admit it requires quite a bit of configuration since you need to tell it about the system and how you launch MPI jobs there.

@ocaisa waow cool! Many thanks!!

guillaumeeb commented 1 year ago

@appassionate, with your comment, it is not really clear to me what you're trying to achieve.

Anyway, if @ocaisa answer suits you, this is perfect, if not, I encourage you to open a new issue and try to make your issue a bit clearer.

appassionate commented 1 year ago

@appassionate, with your comment, it is not really clear to me what you're trying to achieve.

Anyway, if @ocaisa answer suits you, this is perfect, if not, I encourage you to open a new issue and try to make your issue a bit clearer.

Thanks for your suggestion! jobqueue_features have customized the "SlurmCluster" for some MPI using, i believe there will be some using in such as "more nodes" in slurm which is suitable for me.