environmental-forecasting / model-ensembler

Model Ensemble tool for batch workflows on HPCs
https://pypi.org/project/model-ensembler/
MIT License
13 stars 0 forks source link

find_id in cluster.slurm improperly uses scontrol #20

Closed JimCircadian closed 2 years ago

JimCircadian commented 2 years ago

scontrol won't return jobs that have disappeared from the queue. This is an issue if the status update for a job completed isn't caught, which is only likely to be a serious issue when status check timers are too high. This needs addressing though, to pick up jobs that have left the queue and mark the job appropriately

JimCircadian commented 2 years ago

@CRosieWilliams identified this in WAVIhpc runs, so getting a fix rolled out ASAP


[15-03-22 10:22:14    :WARNING ] - Command returned err: None
[15-03-22 10:22:14    :ERROR   ] - Job status for run PIGTHW3km_sanity_checks_t100_runs-0 retrieval whilst slurm running, waiting and retrying
Traceback (most recent call last):
  File "/data/hpcdata/users/chll1/WAVI_Julia/WAVIhpc/venv/lib/python3.7/site-packages/model_ensembler/batcher.py", line 264, in run_batch_item
    job = await cluster.find_id(job_id)
  File "/data/hpcdata/users/chll1/WAVI_Julia/WAVIhpc/venv/lib/python3.7/site-packages/model_ensembler/cluster/slurm.py", line 46, in find_id
    if v.split("=")[0] == "JobName"][0],
IndexError: list index out of range

Or more specifically, one job can get stuck in that and hold the whole thing up. If I see that, I quit it and restart. Not sure if that's good practise, but it keeps it ticking over.```