Closed juliohm closed 4 years ago
try it without the -o
flag. the julia master process reads the standard out of the workers to get the IPaddress and port they're listening on. with -o they all get concatenated into a single file. i use the -Ne flag instead.
Amazing. It is working finally! 🙏 So how can we instruct future users about these important details? Flags that are prohibited for it to work? Maybe the code should check the flags passed, and report something like "please do not use -J because we already pass it inside the manager, nor -o because we read stdout from workers"?
Also, I think we need a new release of the package with this working version of the manager. The latest release is 50ish commits behind master.
Interesting. I can pass the -o
flag without problems. I get both stdout from bpeek and in the file.
I had the exact same symptom initially though and for me the root issue was that ports were not open.
Would it be a workable solution to move some of the hardcoded flags etc to being defaults in the LsfManager instead? I guess at some point it becomes painful to support all possible ways users can attempt to launch jobs then, e.g. in the bpeek function.
@DrChainsaw the master version uses bsub
as the launch command. You mean that you tried on the master version as well? I think the less we do for the users the best here. I learned after an error that I couldn't pass -J
nor -o
. So IMO we should document which flags are not permitted due to internal implementation details and avoid adding more.
Also, after trying a bit more with an actual script that is more evolved than Hello World, the resources seem to be allocated but the script is not executed. Julia exits without errors. Very hard to debug when the error output is being filtered.
Given that the Hello World example is working. I will try to investigate and refactor the LSFManager in future PRs.
Closing this issue for now.
@juliohm I'm using the commit from september 12:th.
bsub
is the normal launch command at my place and it seems to work in your case as well as jobs are started.
Perhaps try adding that -o
(or -oo
to append) and see if it really makes LSF not print the output when bpeeking. At my place it does work:
$bsub -oo lsfout.txt 'echo hi; sleep 10'
$ bpeek -f
<< output from stdout >>
hi
I have used this to debug issues in my scripts as well by adding -oo outfile
to the bsub_flags
in addprocs_lsf
and sprinkling the code with logging.
If it does indeed block output, consider just using bpeek -f jobnr[X]
to see the output from the script.
It is really unfortunate that the
LSFManager
isn't working out of the box.After adding
] add ClusterManagers#master
, I've submitted this trivialmain.jl
in the cluster:by doing
julia main.jl
. The LSF managers launchs multiple jobs, but they fail to do the work. Is there any easy fix to this? How can I debug further what is happening if the workers don't even print to stdout?Also, it would be nice to inform users that they cannot pass
-J jobname
to thebsub_flags
option.