environmental-forecasting / model-ensembler

Model Ensemble tool for batch workflows on HPCs
https://pypi.org/project/model-ensembler/
MIT License
13 stars 0 forks source link

Job submission detection error #31

Closed JimCircadian closed 2 years ago

JimCircadian commented 2 years ago

The slurm accounting daemon doesn't necessarily register the job quickly enough leading to a narly status. Make it a bit defensive/prettier when the job is yet to be detected:

[23-03-22 15:01:58    :INFO    ] - Submitted job with ID 4448347
[23-03-22 15:01:58    :DEBUG   ] - Executing command sacct -XnP -j 4448347 -o jobname,state,start,end, cwd unset
[23-03-22 15:01:58    :DEBUG   ] - Executing command sbatch scripts/run_ensemble_member, cwd /data/hpcdata/users/jambyr/
wavi/WAVIhpc/cases/10k_test-4
[23-03-22 15:01:58    :DEBUG   ] - Executing command sbatch scripts/run_ensemble_member, cwd /data/hpcdata/users/jambyr/
wavi/WAVIhpc/cases/10k_test-5
[23-03-22 15:01:58    :DEBUG   ] - Command successful
[23-03-22 15:01:58    :INFO    ] - Submitted job with ID 4448348
[23-03-22 15:01:58    :DEBUG   ] - Executing command sacct -XnP -j 4448348 -o jobname,state,start,end, cwd unset
[23-03-22 15:01:58    :DEBUG   ] - Command successful
[23-03-22 15:01:58    :WARNING ] - Job 4448347 not registered yet, or error encountered
[23-03-22 15:01:58    :ERROR   ] - not enough values to unpack (expected 4, got 1)
Traceback (most recent call last):
  File "/data/hpcdata/users/jambyr/wavi/WAVIhpc/venv/lib/python3.7/site-packages/model_ensembler/batcher.py", line 249, in run_batch_item
    job = await cluster.find_id(job_id)
  File "/data/hpcdata/users/jambyr/wavi/WAVIhpc/venv/lib/python3.7/site-packages/model_ensembler/cluster/slurm.py", line 45, in find_id
    (name, state, started, finished) = output.split("|")
ValueError: not enough values to unpack (expected 4, got 1)
JimCircadian commented 2 years ago

This is being caused by the error propogating out of find_job to the batcher itself, which handles it fine. Since it's a bit ugly and is slurm specific, should capture it in there