JuliaParallel / ClusterManagers.jl

Other
235 stars 74 forks source link

Help setting up environment in worker processes #144

Closed juliohm closed 12 months ago

juliohm commented 3 years ago

@bjarthur or anyone with access to a LSF manager, could you please try to run this hello world script:

# setup environment in master process
using Pkg; Pkg.activate(@__DIR__)
Pkg.instantiate(); Pkg.precompile()

# add worker processes to pool
using ClusterManagers
@info "Waiting for resources..."
addprocs_lsf(10)
@info "Starting the job... šŸ™"

using Distributed

# setup environment in all processes
@everywhere begin
  using Pkg; Pkg.activate(@__DIR__)
  Pkg.instantiate(); Pkg.precompile()
end

# ------------
# MAIN SCRIPT
# ------------

println("Hello from Julia")
println("Number of workers: ", nworkers())
for i in workers()
    host, pid = fetch(@spawnat i (gethostname(), getpid()))
    println("Hello from process $(pid) on host $(host)!")
end

It doesn't work for me because of that block of code that tries to setup the environment in worker processes:

# setup environment in all processes
@everywhere begin
  using Pkg; Pkg.activate(@__DIR__)
  Pkg.instantiate(); Pkg.precompile()
end

If you comment out this block, you should see the hello world from all processes. Can you advise on how we should setup environments in worker processes? I find it particularly convoluted that we need to first activate on master, then load ClusterManagers.jl to add processes, then activate on workers.

DrChainsaw commented 3 years ago

Something like this:

(@v1.5) pkg> activate .
 Activating environment at `~/juliaproj/ClusterManagersTest/Project.toml`

julia> using ClusterManagers

julia> lsfprocs = addprocs_lsf(5)
5-element Array{Int64,1}:
 2
 3
 4
 5
 6

julia> using Distributed

julia> @everywhere lsfprocs begin

         using Pkg;
         @info "using Pkg ok"
         Pkg.activate(@__DIR__)
         @info "activate ok!"
         Pkg.instantiate(); 
         @info "Instantiate ok!"
         Pkg.precompile()
         @info "Precompile ok!"
         using Distributed
         @info "Worker $(myid()) is ready to go!"
         end

julia> for i in workers()
           host, pid = fetch(@spawnat i (gethostname(), getpid()))
           println("Hello from process $(pid) on host $(host)!")
       end
Hello from process 63666 on host <someLSFHOST>!
Hello from process 67479 on host <someLSFHOST>!
Hello from process 65839 on host <someLSFHOST>!
Hello from process 63693 on host <someLSFHOST>!
Hello from process 65864 on host <someLSFHOST>!

julia> rmprocs(lsfprocs)
Task (done) @0x00007fa42121e710

Meanwhile in a bash shell:

$ bpeek -f 553125[1]
<< output from stdout >>
julia_worker:PORT#IP
[ Info: using Pkg ok
[ Info: activate ok!
[ Info: Instantiate ok!
[ Info: Precompile ok!
[ Info: Worker 2 is ready to go!
 Activating environment at `~/juliaproj/ClusterManagersTest/Project.toml`
Precompiling project...

For some reason the stuff after julia_worker does not print until after I call rmprocs.

Also, omitting the process ids in @everywhere gives an annoying repl printout error which persists and kinda messes up further printing. It does not seem to have anything to do with the task failing as I get the success messages from each worker. It rather seems to come from the main process, possibly some strange collision of the eval and how Pkg does its logging. Could this be something which happens to you as well and makes it harder to debug?

I agree that having to do remote_eval (which @everywhere is basically an alias for) can be pretty cumbersome, making it hard to design user friendly code which uses Distributed. I wish it was possible to put a worker flag to say something like "just inherit whatever is loaded on the process which spawns the workers", but I'm sure there is a good reason as to why it doesn't exist given how convenient it seems it would be.

About setting up the environment in general: For me, the trick that has solved most of my issues is to add exeflags="--project"when spawning workers instead of trying to do so via Pkg through @everywhere. YMMV of course.

juliohm commented 3 years ago

That is very helpful @DrChainsaw , thanks for the detailed suggestions. My question now is why the code above works fine with addprocs but fails with addprocs_lsf? The results suggest that we could improve addprocs_lsf somehow.

Also, you mentioned the exeflags options. It is available in addprocs but not in addprocs_lsf, correct? Can you elaborate on the suggestion? I personally prefer to not touch these flags, and let the script do the work for the user as described in https://github.com/juliohm/julia-distributed-computing I will try to update the example there after we figure out what is happening here.

DrChainsaw commented 3 years ago

The results suggest that we could improve addprocs_lsf somehow.

Maybe, but it might as well be an issue with Distributed. As you see from the code, there is not alot going on in lsf.jl. All it does basically is the bsub and bpeek commands and parsing of their output. The rest is Distributed afaik. I have not dug deep into Distributed, but I would guess that there is a lot of things which are done differently if the procs are on the same machine compared to if they are on remote machines. I don't think that Distributed communicates to procs on the same machine using TCP sockets for instance, that would just be like walking over the bridge for water.

It is available in addprocs but not in addprocs_lsf, correct? addprocs_lsf forwards all keyword arguments it does not declare itself to addprocs, so exeflags is available. If you look at launch in lsf.jl you'll see that there is a bit of special handling to append the --worker flag to the exeflags.

I don't recall now what issue I had for which the only solution seemed to be to add --project to exeflags. Once it solved the issue I kinda just always add it just to be sure but for all I know things might just work as well without it. I found the suggestion in some discourse thread which afaik did not have any conclusion about it either :(

juliohm commented 3 years ago

The current state of affairs is quite unfortunate. I will try to dive into Distributed, but I am afraid things will take time to be fixed. Meanwhile I am experimenting with exeflags, it doesn't seem to solve my issues here. I tried with exeflags="--project" and with exeflags="--project=$(Base.active_project())". None of them worked with my script real job script.

bjarthur commented 3 years ago

is there a shared filesystem between the workers?

juliohm commented 3 years ago

Yes. I was told it is a GPFS. Is that a problem?

On Wed, Oct 7, 2020, 20:07 Ben Arthur notifications@github.com wrote:

is there a shared filesystem between the workers?

ā€” You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaParallel/ClusterManagers.jl/issues/144#issuecomment-705239102, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3NFPOSRZBSMP25CXATSJTYDLANCNFSM4SHVE7QQ .

bjarthur commented 3 years ago

if there is, then why do you need to instantiate and precompile? doing so would cause race conditions

juliohm commented 3 years ago

That is a good question. I am assuming it is good practice since it is not documented anywhere that distributed processes shouldn't instantiate and precompile the environment of the master process. Also, I wonder what would happen when workers live in nodes with different hardware like different GPUs. Shouldn't each worker compile for their host node?

On Wed, Oct 7, 2020, 23:27 Ben Arthur notifications@github.com wrote:

if there is, then why do you need to instantiate and precompile? doing so would cause race conditions

ā€” You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaParallel/ClusterManagers.jl/issues/144#issuecomment-705292488, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZQW3PSHXJACOHQYQBR3VTSJUPR3ANCNFSM4SHVE7QQ .

juliohm commented 3 years ago

Meanwhile, can you please elaborate again on your strategies to debug code? I am suspecting that an error is occurring deep in a complex pipeline, and I can't see it in stdout with the default LSFManager options. I am only specifying the queue in bsub_flags currently. @DrChainsaw you suggested -oo but that seems inhibit the script from working. @bjarthur you suggested -Ne but that relies on e-mail, which I think is not good for debugging.

DrChainsaw commented 3 years ago

I don't have a satisfactory solution for debugging really. I use bpeek and/or -oo (as it seems to work in my environment). Perhaps you could manually write output to a file or send it back to the master process (which ofc is just slightly better than being blind).

juliohm commented 12 months ago

Closing as not relevant for the general user base.