JuliaLang / Distributed.jl

Create and control multiple Julia processes remotely for distributed computing. Ships as a Julia stdlib.
https://docs.julialang.org/en/v1/stdlib/Distributed/
MIT License
23 stars 9 forks source link

Cross platform issues with Remote Workers / SSH Cluster Manager / Native Dependecies #22

Open habemus-papadum opened 9 years ago

habemus-papadum commented 9 years ago

Hi -- Using a head node (i.e. procid == 1) that is a mac on v0.3.6, I am trying to use linux based workers using SSHClusterManager.

I experience problems with e.g. using HDF5 --- the basic cause seems to be that

so my linux boxes complain when the cannot locate the mac dylib

I've thought a bit about how to resolve this, but nothing obvious and elegant pops to mind. (For now I've just hacked my deps.jl on my mac to support both OS X and linux)

Have others seen this kind of issue? Is there some simple way to have the workers not pull code from node1 but simply rely on the locally installed packages?

I was thinking of hacking include_from_node1 in the .juliarc.jl on the linux boxes to simply not pull code from node1, but that seems a bit drastic -- any thoughts about whether this would work?

As an aside, while I can understand the motivation for include, using etc to work by delegating to node1 (e.g, simplify the need for code distribution), it does seems a bit difficult to do robustly or in a way that will scale nicely to dozens or hundreds of workers....

thanks

JeffBezanson commented 9 years ago

I believe it is possible to run all nodes, including node 1, remotely, and connect your local REPL to the remote node, thereby avoiding mixed-platform issues. @Keno are there instructions on how to do this?

habemus-papadum commented 9 years ago

Hi,

Thanks, I actually saw the thread where the "repl into a remote" stuff was first done (Very cool!). I believe the link is: https://github.com/JuliaLang/julia/issues/3655

There is also just starting node 1 via ssh/tmux, or running a remote IJulia....

For my particular work flow, I want to use the the remote boxes to condense and summarize a large amount of distributed data and then deliver it to my local box for more detailed analysis, keeping the entire flow as interactive as possible, going back and forth many times (so ultimately, the point is to flexibly get interesting subsets of data from one site to another, which makes a purely remote solution not so good)

I managed to hack things enough to get it to work for now. I don't endorse what follows, but just so it's documented in case others come across this issue:

    # Load dependencies
    @osx_only  begin 
      @checked_lib libhdf5 "/usr/local/lib/libhdf5.dylib"
    end

    @linux_only begin
      @checked_lib libhdf5 "/usr/lib64/libhdf5.so.7"
    end

(The exact details will depend on what version of hdf5 you have installed and so forth)


So that is quite horrible and will likely crumble with every little change. On the flip side I was able to drive 60 workers on 7 boxes from my mac and everything seemed to work amazingly well in terms of connection times, throughput, and so forth, and so, for what it's worth, I'm a happy customer !

In case anyone is interested, it turns out remote workers slurp .juliarc.jl from node1, which has the potential for many odd issues...

thanks!

rened commented 9 years ago

I tried to work around this in https://github.com/JuliaLang/BinDeps.jl/pull/130 but did not completely follow through revising for the comments (yet). I'd also be interested in making the the cross-platform experience as seamless as possible - I'll try to continue with that PR as soon as I can.

ViralBShah commented 9 years ago

Cc @amitmurthy

habemus-papadum commented 9 years ago

I've been driving linux boxes from a mac for a few weeks now, and despite my ridiculous hacks it's been extremely useful.

My two cents is that rather than adjusting BinDeps and other packages to work around these issues, it might be better/cleaner to be able to launch workers with a command line switch that has them simply load julia code from their local drive rather than slurping from node1 -- I'm already rsync'ing datasets and non-julia code to various nodes so there is not much convenience gained on my end by the current behavior.

kshyatt commented 7 years ago

Is this still a pain to get working?

tkelman commented 7 years ago

yes. comes up on discourse every few weeks.