Random-seeming clear out of run directory mid run

environmental-forecasting / model-ensembler

Model Ensemble tool for batch workflows on HPCs

https://pypi.org/project/model-ensembler/

MIT License

13 stars 1 forks source link

Random-seeming clear out of run directory mid run #29

Closed CRosieWilliams closed 2 years ago

CRosieWilliams commented 2 years ago

When running ensembler with WAVI, at some pickups during the run, the run directory is cleaned out and it starts from t=0 again. I can't see why it does this, and it doesn't do it at the same time each time. It's not because of the NODE FAIL (the node doesn't fail). It is not because it doesn't find a checkpoint, as at the point the run directory is emptied, it contains loads of checkpoints. It's a bit of a mystery because by the nature of the problem, it deletes everything except the .out files, so it's hard to see what's happening.

CRosieWilliams commented 2 years ago

Oh I think it could be happening because in the last submission of the job, it didn't create a new checkpoint. I'm not sure why it can't pick up the checkpoint from the previous job since it's still there? Of course, that could lead to it forever resubmitting this job, but it would be better than it cleaning out everything it's already done.

JimCircadian commented 2 years ago

This is an interesting one @CRosieWilliams: when you're referring to the checkpoints are they permanent checkpoints PChkpt_*.jld2? These are the only ones processed by WAVIhpc workflow code. Also, it'll clear out the files if LAST_EXIT is missing, so even if there are checkpoints it relies on the last exit file to determine if the previous run was, er, run.

Somehow we've got an eventuality (again) where LAST_EXIT isn't produced. There has been a recent SLURM change on the cluster that could be causing the workflow code for WAVI to go down a new path that results in LAST_EXIT not getting produced (or it might even have fixed it), but we'll need to determine how the job didn't get finished properly.

CRosieWilliams commented 2 years ago

Yes I'm referring to the permanent checkpoints, exactly.

Ah yes I think it probably is missing LAST_EXIT that causes the problem, then. So it must sometimes produce LAST_EXIT and sometimes not... I think it could be a change caused by the SLURM changes, yes.

What's the best way to go about fixing this? It's quite a major problem right now. Should I ask servicedesk?

JimCircadian commented 2 years ago

I'll give you a shout on Slack @CRosieWilliams, will need to identify the logic that causes the runs to exit without producing an exit code!

JimCircadian commented 2 years ago

This appears to be something awry with configuration changes on the cluster upsetting WAVIhpc signal handling/last exit code. Probably not an ensembler issue, but will keep open until confirmed...

JimCircadian commented 2 years ago

This is being caused by WAVIhpc handling not accounting for SIGKILL being issued by the scheduler, so this won't require accounting for in the ensembler.

JimCircadian commented 2 years ago

For posterity, this manifested thus:

cat job.4389994.node004.{out,err}
Pre-run script found, running
removed ‘LAST_EXIT’
job chain: pickup from permanent checkpoint
niter0 = 70
slurmstepd: error: *** JOB 4389994 ON node004 CANCELLED AT 2022-03-14T22:10:02 DUE TO TIME LIMIT ***

head job.4390522.node004.out
Pre-run script found, running
Cleaning run directory
removed ‘driver.jl’
removed ‘outfile0000000010.mat’
removed ‘outfile0000000020.mat’

JimCircadian commented 2 years ago

I've closed this since we've isolated and fixed the issue as part of WAVIhpc development. Thankfully (for the ensembler) the issue is more to do with the workflow implementation and cluster behaviour, rather than anything the ensembler itself does...