Closed npaulson closed 5 years ago
Finally found some time to experiment with this.
Here's my complete testing code:
from dask_jobqueue import SLURMCluster
from dask.distributed import Client
import time
import itertools as it
def _worker(worker_id):
print(f'{worker_id=} starting.')
time.sleep(10)
print(f'{worker_id=} done.')
def dask_map(wait=True, with_resource=True, cores=20, workers=5):
if with_resource:
extra = ['--resources GPU=1']
resources = {'GPU': 1}
else:
extra = []
resources = {}
cluster = SLURMCluster(
cores=cores,
processes=1,
memory='188GB',
walltime='24:00:00',
queue='my.queue',
shebang='#!/bin/bash -l',
local_directory='/my/local/dir/',
extra=extra,
job_extra=['--gres=gpu:1', ],
env_extra=[
'module load cuda',
'module load anaconda',
'conda init bash',
'conda activate /my/conda/env',])
cluster.scale(workers)
client = Client(cluster)
if wait:
client.wait_for_workers(workers)
f = client.map(_worker, list(range(workers)), resources=resources)
[xx.result() for xx in f]
cluster.scale(n=0, jobs=0)
cluster.close()
if __name__ == '__main__':
for wait, with_resource, cores in it.product([False, True], [True, False], [20, 1]):
t0 = time.time()
dask_map(wait=wait, with_resource=with_resource, cores=cores)
total_t = time.time() - t0
print(f'{wait=}, {with_resource=}, {cores=} took {total_t:.2f} s')
This prints the following:
wait=False, with_resource=True, cores=20 took 55.11 s
wait=False, with_resource=True, cores=1 took 14.09 s
wait=False, with_resource=False, cores=20 took 13.84 s
wait=False, with_resource=False, cores=1 took 14.29 s
wait=True, with_resource=True, cores=20 took 15.28 s
wait=True, with_resource=True, cores=1 took 14.45 s
wait=True, with_resource=False, cores=20 took 14.58 s
wait=True, with_resource=False, cores=1 took 14.43 s
(the individual lines are separated by
distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
... I removed that from the output above for clarity. This seems to be caused be the creation / closing of the cluster in the loop; it doesn't influence the results, which are identical when this is not run in a loop.)
Anyway, as you can see, the issue only occurs in the wait=False, with_resource=True, cores=20
case, where the jobs run sequentially. The workers are all there, but only one of them is doing anything. This is what it looks like in the dashboard:
Thanks a lot for doing this! So it looks like work-stealing with resources does happen when cores=1, which is good to confirm.
The case where it does not work is when you have cores=20. I am not an expert on the work stealing logic, but I can only guess that work-stealing does not kick in because the scheduler sees 20 slots on the first node (one slot for each core) with 5 tasks to execute so the worker doesn't seem busy and work stealing does not seem worth it. It could be interesting to confirm this hypothesis by submitting many more short tasks and see whether work stealing happens. It could also be worth getting some inputs by @mrocklin to make sure my hypothesis seems reasonable to him ...
Maybe this is something that can be improved in distributed, not sure. I guess it should be possible to create a reproducer with distributed.SSHCluster
so that more people are likely to look at it.
I looked into the documentation but couldn't find out how to create an SSHCluster
with resources, not sure I can be of any help there. Should I create an issue specifically about work stealing on the distributed tracker?
@fabiansvara @lesteve Hi, sorry for discussing this in this old issue, any news in this using of job-queue? In my case, i want to map more "MPI" jobs in client by cluster.scale.
Dask scheduler seems to manage task by the total threads in the workers. If cores to be 32(my hpc cores per node) ,there will be 32 threads in a worker, and any client.submit method will be executed as one thread ( i guess). So It is tricky to set cores to be 1, let the scheduler know there only one thread in a work, but actually, task in this worker will launch a MPI job, which might be "work-stealing " you are talking about? thanks.
Here is my cluster config.
cluster = SLURMCluster(
n_workers=1,
processes=1,
queue=queue,
interface="ib0",
walltime=max_run_time,
cores=1, #trick for using this slurm cluster
memory="100 GB",
job_extra_directives=[f'--ntasks-per-node={core_per_node}'],
#job_cpu=1,
)
print(cluster.job_script())
Is there way that dask scheduler not manage submit task by threads but the worker_num? Any suggestions will be helpful! thank u! :)
@appassionate We created a library that does what you are asking for but I admit it requires quite a bit of configuration since you need to tell it about the system and how you launch MPI jobs there. The library is at https://github.com/E-CAM/jobqueue_features and there's a tutorial at https://github.com/E-CAM/jobqueue_features_workshop_materials (and you can find a recording of the tutorial at https://www.youtube.com/watch?v=FpMua8iJeTk&ab_channel=E-CAM).
I haven't touched it in a few months, need to check if our CI is still passing. That package is working with the latest version of jobqueue (0.8.1)
@appassionate We created a library that does what you are asking for but I admit it requires quite a bit of configuration since you need to tell it about the system and how you launch MPI jobs there.
@ocaisa waow cool! Many thanks!!
@appassionate, with your comment, it is not really clear to me what you're trying to achieve.
Anyway, if @ocaisa answer suits you, this is perfect, if not, I encourage you to open a new issue and try to make your issue a bit clearer.
@appassionate, with your comment, it is not really clear to me what you're trying to achieve.
Anyway, if @ocaisa answer suits you, this is perfect, if not, I encourage you to open a new issue and try to make your issue a bit clearer.
Thanks for your suggestion! jobqueue_features have customized the "SlurmCluster" for some MPI using, i believe there will be some using in such as "more nodes" in slurm which is suitable for me.
I am trying to use Dask to do parallel processing on multiple nodes on supercomputing resources - yet the Dask-distributed map only takes advantage of one of the nodes. Note that I put this up on stackoverflow but didn't get attention so now I'm giving here a go.
Here is a test script I am using to set up the client and perform a simple operation:
And here is the output:
While printing out the client info indicates that Dask has the correct number of nodes (processes) and tasks per node (cores), the socket.gethostname() output and time-stamps indicate that the second node isn't used. I do know that dask-jobqueue successfully requested two nodes, and that both jobs complete at the same time. I tried using different MPI Fabrics for inter- and intra-node communication (e.g. tcp, shm:tcp, shm:ofa, ofa, ofi, dapl) but this did not change the result. I also tried removing the "export I_MPI_FABRICS" command and using the "interface" option, but this caused the code to hang.
Thanks in advance for any assistance.
-Noah