JuliaParallel / ClusterManagers.jl

Other
242 stars 74 forks source link

Make SLUM worker startup more robust and provide more feedback #200

Closed oschulz closed 5 months ago

oschulz commented 5 months ago

Builds on top of #199.

Before (but with fix in #199):

# ... wait (but what's going on?) ...

connecting to worker 1 out of 12
connecting to worker 2 out of 12
connecting to worker 3 out of 12
connecting to worker 4 out of 12
connecting to worker 5 out of 12
connecting to worker 6 out of 12
connecting to worker 7 out of 12
connecting to worker 8 out of 12
connecting to worker 9 out of 12
connecting to worker 10 out of 12
connecting to worker 11 out of 12
connecting to worker 12 out of 12

After:

[ Info: Starting SLURM job julia-26323452: `srun -J julia-26323452 -n 12 -D /homedir/some/dir --cpus-per-task=8 --mem-per-cpu=8G --cpu-bind=cores --mem-bind=local -o /homedir/slurm-julia-output/julia-26323452-12983479872-%4t.out /path/to/bin/julia --project=/homedir/.julia/environments/someenv --threads=8 --heap-size-hint=34359738368 --worker=qy8ZReqHiDfwjq6a`
[ Info: Worker 0 (after 0 s): No output file "/homedir/slurm-julia-output/julia-26323452-12983479872-0000.out" yet
[ Info: Worker 0 (after 1 s): Output file found, but no connection details yet
[ Info: Worker 0 (after 2 s): Output file found, but no connection details yet
[ Info: Worker 0 (after 4 s): Output file found, but no connection details yet
[ Info: Worker 0 (after 6 s): Output file found, but no connection details yet
[ Info: Worker 0 ready after 10 s on host 149.217.13.126, port 9101
[ Info: Worker 1 ready after 10 s on host 149.217.13.126, port 9102
[ Info: Worker 2 ready after 10 s on host 149.217.13.126, port 9103
[ Info: Worker 3 ready after 11 s on host 149.217.13.126, port 9104
[ Info: Worker 4 ready after 11 s on host 149.217.13.126, port 9105
[ Info: Worker 5 ready after 11 s on host 149.217.13.126, port 9106
[ Info: Worker 6 ready after 12 s on host 149.217.13.126, port 9107
[ Info: Worker 7 ready after 12 s on host 149.217.13.126, port 9108
[ Info: Worker 8 ready after 12 s on host 149.217.13.126, port 9109
[ Info: Worker 9 ready after 12 s on host 149.217.13.126, port 9110
[ Info: Worker 10 ready after 12 s on host 149.217.13.126, port 9111
[ Info: Worker 11 ready after 12 s on host 149.217.13.126, port 9112
Moelf commented 5 months ago

LGTM

oschulz commented 5 months ago

Good to merge from my side (I'm looking into ElasticManager next, following advice from @JBlaschke ).

kescobo commented 5 months ago

Should we cut a release on this, or do you have more to do before that?

oschulz commented 5 months ago

Should we cut a release on this, or do you have more to do before that?

Thanks, yes. I have some people who need to use this, I think it's good for now.

kescobo commented 5 months ago

https://github.com/JuliaRegistries/General/pull/105240

oschulz commented 5 months ago

Merci @kescobo !