JuliaParallel / ClusterManagers.jl

Other
242 stars 74 forks source link

addprocs_lsf does not work in LSF cluster #142

Closed juliohm closed 4 years ago

juliohm commented 4 years ago

It is really unfortunate that the LSFManager isn't working out of the box.

After adding ] add ClusterManagers#master, I've submitted this trivial main.jl in the cluster:

# activate environment in master process
using Pkg; Pkg.activate(@__DIR__)
Pkg.instantiate(); Pkg.precompile()

# add 10 worker processes to pool
using ClusterManagers
addprocs_lsf(10, bsub_flags=`-q x86_6h -o log.txt`)

# ------------
# MAIN SCRIPT
# ------------

using Distributed

np = nprocs()

println("Hello from Julia")
println("Number of processes: $np")
for i in workers()
    host, pid = fetch(@spawnat i (gethostname(), getpid()))
    println("Hello from process $(pid) on host $(host)!")
end

by doing julia main.jl. The LSF managers launchs multiple jobs, but they fail to do the work. Is there any easy fix to this? How can I debug further what is happening if the workers don't even print to stdout?

Also, it would be nice to inform users that they cannot pass -J jobname to the bsub_flags option.

julia_worker:9096#9.47.192.190
julia_worker:9897#9.47.192.91
julia_worker:9313#9.47.192.186
julia_worker:9951#9.47.192.74
julia_worker:9506#9.47.192.178
julia_worker:9456#9.47.192.69
julia_worker:9484#9.47.194.151
julia_worker:9267#9.47.194.121
julia_worker:9358#9.47.194.138
julia_worker:9182#9.47.194.175

------------------------------------------------------------
Sender: LSF System <rer@dccxc150>
Subject: Job 870910[5]: <julia-20472[1-10]> in cluster <dcc> Done

Job <julia-20472[1-10]> was submitted from host <dccxl001> by user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:33 2020
Job was executed on host(s) <dccxc150>, in queue <x86_6h>, as user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:34 2020
</u/juliohm> was used as the home directory.
</u/juliohm/test> was used as the working directory.
Started at Tue Oct  6 11:54:34 2020
Terminated at Tue Oct  6 11:54:46 2020
Results reported at Tue Oct  6 11:54:46 2020

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/u/juliohm/julia-1.5.0/bin/julia --worker=HcEENBJM0JtBKcm8
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time :                                   1.95 sec.
    Max Memory :                                 146 MB
    Average Memory :                             146.00 MB
    Total Requested Memory :                     1048.00 MB
    Delta Memory :                               902.00 MB
    Max Swap :                                   -
    Max Processes :                              3
    Max Threads :                                23
    Run time :                                   34 sec.
    Turnaround time :                            13 sec.

The output (if any) is above this job summary.

Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.

------------------------------------------------------------
Sender: LSF System <rer@dccxc029>
Subject: Job 870910[6]: <julia-20472[1-10]> in cluster <dcc> Exited

Job <julia-20472[1-10]> was submitted from host <dccxl001> by user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:33 2020
Job was executed on host(s) <dccxc029>, in queue <x86_6h>, as user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:34 2020
</u/juliohm> was used as the home directory.
</u/juliohm/test> was used as the working directory.
Started at Tue Oct  6 11:54:34 2020
Terminated at Tue Oct  6 11:55:37 2020
Results reported at Tue Oct  6 11:55:37 2020

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/u/juliohm/julia-1.5.0/bin/julia --worker=HcEENBJM0JtBKcm8
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time :                                   1.82 sec.
    Max Memory :                                 145 MB
    Average Memory :                             145.00 MB
    Total Requested Memory :                     1048.00 MB
    Delta Memory :                               903.00 MB
    Max Swap :                                   -
    Max Processes :                              3
    Max Threads :                                23
    Run time :                                   90 sec.
    Turnaround time :                            64 sec.

The output (if any) is above this job summary.

------------------------------------------------------------
Sender: LSF System <rer@dccxc051>
Subject: Job 870910[9]: <julia-20472[1-10]> in cluster <dcc> Exited

Job <julia-20472[1-10]> was submitted from host <dccxl001> by user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:33 2020
Job was executed on host(s) <dccxc051>, in queue <x86_6h>, as user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:34 2020
</u/juliohm> was used as the home directory.
</u/juliohm/test> was used as the working directory.
Started at Tue Oct  6 11:54:34 2020
Terminated at Tue Oct  6 11:55:37 2020
Results reported at Tue Oct  6 11:55:37 2020

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/u/juliohm/julia-1.5.0/bin/julia --worker=HcEENBJM0JtBKcm8
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time :                                   1.78 sec.
    Max Memory :                                 159 MB
    Average Memory :                             156.00 MB
    Total Requested Memory :                     1048.00 MB
    Delta Memory :                               889.00 MB
    Max Swap :                                   -
    Max Processes :                              3
    Max Threads :                                23
    Run time :                                   81 sec.
    Turnaround time :                            64 sec.

The output (if any) is above this job summary.

------------------------------------------------------------
Sender: LSF System <rer@dccxc146>
Subject: Job 870910[3]: <julia-20472[1-10]> in cluster <dcc> Exited

Job <julia-20472[1-10]> was submitted from host <dccxl001> by user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:33 2020
Job was executed on host(s) <dccxc146>, in queue <x86_6h>, as user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:34 2020
</u/juliohm> was used as the home directory.
</u/juliohm/test> was used as the working directory.
Started at Tue Oct  6 11:54:34 2020
Terminated at Tue Oct  6 11:55:37 2020
Results reported at Tue Oct  6 11:55:37 2020

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/u/juliohm/julia-1.5.0/bin/julia --worker=HcEENBJM0JtBKcm8
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time :                                   1.76 sec.
    Max Memory :                                 158 MB
    Average Memory :                             158.00 MB
    Total Requested Memory :                     1048.00 MB
    Delta Memory :                               890.00 MB
    Max Swap :                                   -
    Max Processes :                              3
    Max Threads :                                23
    Run time :                                   63 sec.
    Turnaround time :                            64 sec.

The output (if any) is above this job summary.

Master process (id 1) could not connect within 60.0 seconds.
exiting.

------------------------------------------------------------
Sender: LSF System <rer@dccxc034>
Subject: Job 870910[7]: <julia-20472[1-10]> in cluster <dcc> Exited

Job <julia-20472[1-10]> was submitted from host <dccxl001> by user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:33 2020
Job was executed on host(s) <dccxc034>, in queue <x86_6h>, as user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:34 2020
</u/juliohm> was used as the home directory.
</u/juliohm/test> was used as the working directory.
Started at Tue Oct  6 11:54:34 2020
Terminated at Tue Oct  6 11:55:37 2020
Results reported at Tue Oct  6 11:55:37 2020

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/u/juliohm/julia-1.5.0/bin/julia --worker=HcEENBJM0JtBKcm8
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time :                                   1.79 sec.
    Max Memory :                                 155 MB
    Average Memory :                             155.00 MB
    Total Requested Memory :                     1048.00 MB
    Delta Memory :                               893.00 MB
    Max Swap :                                   -
    Max Processes :                              3
    Max Threads :                                23
    Run time :                                   71 sec.
    Turnaround time :                            64 sec.

The output (if any) is above this job summary.

------------------------------------------------------------
Sender: LSF System <rer@dccxc138>
Subject: Job 870910[4]: <julia-20472[1-10]> in cluster <dcc> Exited

Job <julia-20472[1-10]> was submitted from host <dccxl001> by user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:33 2020
Job was executed on host(s) <dccxc138>, in queue <x86_6h>, as user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:34 2020
</u/juliohm> was used as the home directory.
</u/juliohm/test> was used as the working directory.
Started at Tue Oct  6 11:54:34 2020
Terminated at Tue Oct  6 11:55:37 2020
Results reported at Tue Oct  6 11:55:37 2020

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/u/juliohm/julia-1.5.0/bin/julia --worker=HcEENBJM0JtBKcm8
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time :                                   1.81 sec.
    Max Memory :                                 152 MB
    Average Memory :                             152.00 MB
    Total Requested Memory :                     1048.00 MB
    Delta Memory :                               896.00 MB
    Max Swap :                                   -
    Max Processes :                              3
    Max Threads :                                23
    Run time :                                   77 sec.
    Turnaround time :                            64 sec.

The output (if any) is above this job summary.

------------------------------------------------------------
Sender: LSF System <rer@dccxn031>
Subject: Job 870910[2]: <julia-20472[1-10]> in cluster <dcc> Exited

Job <julia-20472[1-10]> was submitted from host <dccxl001> by user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:33 2020
Job was executed on host(s) <dccxn031>, in queue <x86_6h>, as user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:34 2020
</u/juliohm> was used as the home directory.
</u/juliohm/test> was used as the working directory.
Started at Tue Oct  6 11:54:34 2020
Terminated at Tue Oct  6 11:55:37 2020
Results reported at Tue Oct  6 11:55:37 2020

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/u/juliohm/julia-1.5.0/bin/julia --worker=HcEENBJM0JtBKcm8
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time :                                   2.20 sec.
    Max Memory :                                 148 MB
    Average Memory :                             148.00 MB
    Total Requested Memory :                     1048.00 MB
    Delta Memory :                               900.00 MB
    Max Swap :                                   -
    Max Processes :                              3
    Max Threads :                                23
    Run time :                                   80 sec.
    Turnaround time :                            64 sec.

The output (if any) is above this job summary.

------------------------------------------------------------
Sender: LSF System <rer@dccxn001>
Subject: Job 870910[1]: <julia-20472[1-10]> in cluster <dcc> Exited

Job <julia-20472[1-10]> was submitted from host <dccxl001> by user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:33 2020
Job was executed on host(s) <dccxn001>, in queue <x86_6h>, as user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:34 2020
</u/juliohm> was used as the home directory.
</u/juliohm/test> was used as the working directory.
Started at Tue Oct  6 11:54:34 2020
Terminated at Tue Oct  6 11:55:37 2020
Results reported at Tue Oct  6 11:55:37 2020

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/u/juliohm/julia-1.5.0/bin/julia --worker=HcEENBJM0JtBKcm8
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time :                                   2.41 sec.
    Max Memory :                                 148 MB
    Average Memory :                             148.00 MB
    Total Requested Memory :                     1048.00 MB
    Delta Memory :                               900.00 MB
    Max Swap :                                   -
    Max Processes :                              3
    Max Threads :                                23
    Run time :                                   82 sec.
    Turnaround time :                            64 sec.

The output (if any) is above this job summary.

------------------------------------------------------------
Sender: LSF System <rer@dccxn018>
Subject: Job 870910[10]: <julia-20472[1-10]> in cluster <dcc> Exited

Job <julia-20472[1-10]> was submitted from host <dccxl001> by user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:33 2020
Job was executed on host(s) <dccxn018>, in queue <x86_6h>, as user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:34 2020
</u/juliohm> was used as the home directory.
</u/juliohm/test> was used as the working directory.
Started at Tue Oct  6 11:54:34 2020
Terminated at Tue Oct  6 11:55:37 2020
Results reported at Tue Oct  6 11:55:37 2020

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/u/juliohm/julia-1.5.0/bin/julia --worker=HcEENBJM0JtBKcm8
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time :                                   2.30 sec.
    Max Memory :                                 150 MB
    Average Memory :                             150.00 MB
    Total Requested Memory :                     1048.00 MB
    Delta Memory :                               898.00 MB
    Max Swap :                                   -
    Max Processes :                              3
    Max Threads :                                23
    Run time :                                   76 sec.
    Turnaround time :                            64 sec.

The output (if any) is above this job summary.

Master process (id 1) could not connect within 60.0 seconds.
exiting.

------------------------------------------------------------
Sender: LSF System <rer@dccxn055>
Subject: Job 870910[8]: <julia-20472[1-10]> in cluster <dcc> Exited

Job <julia-20472[1-10]> was submitted from host <dccxl001> by user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:33 2020
Job was executed on host(s) <dccxn055>, in queue <x86_6h>, as user <juliohm> in cluster <dcc> at Tue Oct  6 11:54:34 2020
</u/juliohm> was used as the home directory.
</u/juliohm/test> was used as the working directory.
Started at Tue Oct  6 11:54:34 2020
Terminated at Tue Oct  6 11:55:39 2020
Results reported at Tue Oct  6 11:55:39 2020

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/u/juliohm/julia-1.5.0/bin/julia --worker=HcEENBJM0JtBKcm8
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time :                                   4.49 sec.
    Max Memory :                                 148 MB
    Average Memory :                             148.00 MB
    Total Requested Memory :                     1048.00 MB
    Delta Memory :                               900.00 MB
    Max Swap :                                   -
    Max Processes :                              3
    Max Threads :                                23
    Run time :                                   78 sec.
    Turnaround time :                            66 sec.

The output (if any) is above this job summary.
bjarthur commented 4 years ago

try it without the -o flag. the julia master process reads the standard out of the workers to get the IPaddress and port they're listening on. with -o they all get concatenated into a single file. i use the -Ne flag instead.

juliohm commented 4 years ago

Amazing. It is working finally! 🙏 So how can we instruct future users about these important details? Flags that are prohibited for it to work? Maybe the code should check the flags passed, and report something like "please do not use -J because we already pass it inside the manager, nor -o because we read stdout from workers"?

juliohm commented 4 years ago

Also, I think we need a new release of the package with this working version of the manager. The latest release is 50ish commits behind master.

DrChainsaw commented 4 years ago

Interesting. I can pass the -o flag without problems. I get both stdout from bpeek and in the file.

I had the exact same symptom initially though and for me the root issue was that ports were not open.

Would it be a workable solution to move some of the hardcoded flags etc to being defaults in the LsfManager instead? I guess at some point it becomes painful to support all possible ways users can attempt to launch jobs then, e.g. in the bpeek function.

juliohm commented 4 years ago

@DrChainsaw the master version uses bsub as the launch command. You mean that you tried on the master version as well? I think the less we do for the users the best here. I learned after an error that I couldn't pass -J nor -o. So IMO we should document which flags are not permitted due to internal implementation details and avoid adding more.

juliohm commented 4 years ago

Also, after trying a bit more with an actual script that is more evolved than Hello World, the resources seem to be allocated but the script is not executed. Julia exits without errors. Very hard to debug when the error output is being filtered.

juliohm commented 4 years ago

Given that the Hello World example is working. I will try to investigate and refactor the LSFManager in future PRs.

Closing this issue for now.

DrChainsaw commented 4 years ago

@juliohm I'm using the commit from september 12:th.

bsub is the normal launch command at my place and it seems to work in your case as well as jobs are started.

Perhaps try adding that -o (or -oo to append) and see if it really makes LSF not print the output when bpeeking. At my place it does work:

$bsub -oo lsfout.txt 'echo hi; sleep 10'

$ bpeek -f
<< output from stdout >>
hi

I have used this to debug issues in my scripts as well by adding -oo outfile to the bsub_flags in addprocs_lsf and sprinkling the code with logging.

If it does indeed block output, consider just using bpeek -f jobnr[X] to see the output from the script.