Open Aariq opened 2 years ago
It also waits too long!
When the job is done, squeue
displays an empty table for a long time, which doesn't trigger qsub_run_finished()
For example when the HPC run is done and the ED2 output exists
Then qsub_run_finished()
keeps checking for a long time
Eventually, once squeue
returns an error, then qsub_run_finished()
is triggered:
I wonder if there's a better way to do this using or inspired by batchtools
Ah, so here's what's actually going on. If modellauncher is being used, there's only one job ID, but the code to check if the job is finished is currently in a for-loop that runs for every ensemble (even though they all have the same SLURM job). So if you have 100 ensembles, it's running the ssh -T puma squeue --job 4246012
command 100 times every 10 seconds.
This issue is stale because it has been open 365 days with no activity.
Bug Description
qsub_run_finished()
has no delay so it checks if remote jobs are done about every 30s. This floods the console and probably annoys the HPC people. The delay could either be put in theqsub_run_finished()
function (hard-coded or as an argument), or instart_model_runs()
here:https://github.com/PecanProject/pecan/blob/4113859c3c88231234aa3f363e2cbe7d99b09b9c/base/workflow/R/start_model_runs.R#L267
To Reproduce
Run an ED2 model (for example) on an HPC
Expected behavior
Less logger output, maybe a message like "job not finished, checking again in s"
Screenshots