dask / dask-drmaa

Deploy Dask on DRMAA clusters
BSD 3-Clause "New" or "Revised" License
40 stars 22 forks source link

workers can't reach multi-homed scheduler #28

Open dkmichaels opened 7 years ago

dkmichaels commented 7 years ago

Using SGE and wanting to initialize dask fully from python.

My main node is dual-homed, with 1G and 10G interfaces. The 10G is the one that my SGE cluster uses.

from dask_drmaa import DRMAACluster
from dask.distributed import Client

In [9]: cluster = DRMAACluster(hostname='master-10g')
INFO:dask_drmaa.core:Start local scheduler at master-10g

In [10]: cluster.scheduler_address
Out[10]: 'tcp://10.22.150.194:37386' . # this is the master-1g IP, not the one I want 

Meanwhile, the workers are spinning trying to connect to the 1G IP:

tail worker.23523.1.err
distributed.worker - INFO - Trying to connect to scheduler: tcp://10.22.150.194:37386

Can this be extended to allow one to specify the scheduler interface / hostname / IP to give to the workers?

mrocklin commented 7 years ago

That should be doable. When deploying dask using the command line this would be accomplished with the --interface keyword https://stackoverflow.com/questions/43881157/how-do-i-use-an-infiniband-network-with-dask . We probably just need to expose this through the dask-drmaa interface or, better yet, help users to pass through any option.

On Wed, May 24, 2017 at 12:13 AM, dkmichaels notifications@github.com wrote:

Using SGE and wanting to initialize dask fully from python.

My main node is dual-homed, with 1G and 10G interfaces. The 10G is the one that my SGE cluster uses.

from dask_drmaa import DRMAAClusterfrom dask.distributed import Client

In [9]: cluster = DRMAACluster(hostname='master-10g')INFO:dask_drmaa.core:Start local scheduler at master-10g

In [10]: cluster.scheduler_address Out[10]: 'tcp://10.22.150.194:37386' . # this is the master-1g IP, not the one I want

Meanwhile, the workers are spinning trying to connect to the 1G IP:

tail worker.23523.1.err distributed.worker - INFO - Trying to connect to scheduler: tcp://10.22.150.194:37386

Can this be extended to allow one to specify the scheduler interface / hostname / IP to give to the workers?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-drmaa/issues/28, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszG1miLVVArcr47bWZjA_cjHVx4mZks5r82gUgaJpZM4NkaA- .

dkmichaels commented 7 years ago

Here's the workaround I hacked together -- suggestions for improvement welcome:

Replace these lines (note the first line has no effect in the current code):

def create_job_template(...)
        ...
        args = template['args']
        args = [self.scheduler_address] + template['args']
        ...

with:

        # replace scheduler's 1G IP with it's 10G IP
    args = [self.scheduler_address.replace('10.22.150.194', '10.22.250.1')]
    args = args + template['args']

Hardcoding IPs allows me to proceed with my testing, but this is really a hack.

jakirkham commented 6 years ago

Sometimes using the nativeSpecification argument to DRMAA resolves issues like this. Would need to play around on your cluster and/or ask admins to know for sure.