JuliaParallel / ClusterManagers.jl

Other
232 stars 74 forks source link

-o argument in addprocs_slurm leads to an error #185

Open stasis0 opened 1 year ago

stasis0 commented 1 year ago

Hello everyone,

To add workers and schedule jobs on the cluster, I'm using the addprocs_slurm function from ClasterManagers

slurm_cpus = 4
@async addprocs(SlurmManager(slurm_cpus), partition="all", t="00:10:0")

It works as intended

Task (runnable) @0x00002b8be08c5cd0connecting to worker 1 out of 4

srun: job 13332841 queued and waiting for resources

julia> srun: job 13332841 has been allocated resources
connecting to worker 2 out of 4
connecting to worker 3 out of 4
connecting to worker 4 out of 4

However, if I have a lot of workers, the corresponding number of output files appears in the working directory. I decided to add the -o argument and log everything into one file

slurm_cpus = 4
@async addprocs(SlurmManager(slurm_cpus), partition="all", t="00:10:0", o="log.out")

It indeed creates this log file

julia_worker:9007#131.169.193.109
julia_worker:9006#131.169.193.109
julia_worker:9008#131.169.193.109
julia_worker:9009#131.169.193.109
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.

but does not give any workers

Task (runnable) @0x00002b8be01f7260connecting to worker 1 out of 4

srun: job 13332876 queued and waiting for resources

julia> srun: job 13332876 has been allocated resources
srun: error: max-wn009: tasks 0-3: Exited with exit code 1

I decided to have a look at the source code. If I understand correctly, it specifies values for -o and -D independently of my choice. Maybe, it causes trouble

jobname = "julia-$(getpid())"
job_output_name = "$(jobname)-$(trunc(Int, Base.time() * 10))"
make_job_output_path(task_num) = joinpath(job_file_loc, "$(job_output_name)-$(task_num).out")
job_output_template = make_job_output_path("%4t")
srun_cmd = `srun -J $jobname -n $np -o "$(job_output_template)" -D $exehome $(srunargs) $exename $exeflags $(worker_arg())`