CliMA / ClimaCalibrate.jl

Calibration pipeline for ClimaAtmos.jl
Apache License 2.0
4 stars 1 forks source link

`sacct` cmd execution crashes a calibration pipeline if slurmdbd is down #115

Open nefrathenrici opened 2 weeks ago

nefrathenrici commented 2 weeks ago

Running sacct errors when the slurm database daemon is down, causing the pipeline to exit.

If this errors, we should catch it and fall back to squeue. Then, warn the user because we won't be able to determine if a completed job was successful or not.

Error:

sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to host:head1:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
ERROR: LoadError: failed process: Process(`sacct --allocations -u esmbuild --starttime now-1hour -o Submit,Start -n`, ProcessExited(1)) [1]
Stacktrace:
 [1] pipeline_error
   @ ./process.jl:565 [inlined]
 [2] read(cmd::Cmd)
   @ Base ./process.jl:449
 [3] read
   @ ./process.jl:458 [inlined]
 [4] readchomp(x::Cmd)
   @ Base ./io.jl:974
 [5] top-level scope
   @ /central/scratch/esm/slurm-buildkite/climaatmos-ci/21206/climaatmos-ci/calibration/test/e2e_test.jl:108
in expression starting at /central/scratch/esm/slurm-buildkite/climaatmos-ci/21206/climaatmos-ci/calibration/test/e2e_test.jl:107