HPC jobs get checked too often

PecanProject / pecan

The Predictive Ecosystem Analyzer (PEcAn) is an integrated ecological bioinformatics toolbox.

www.pecanproject.org

Other

202 stars 235 forks source link

HPC jobs get checked too often #2991

Open Aariq opened 2 years ago

Aariq commented 2 years ago

Bug Description

qsub_run_finished() has no delay so it checks if remote jobs are done about every 30s. This floods the console and probably annoys the HPC people. The delay could either be put in the qsub_run_finished() function (hard-coded or as an argument), or in start_model_runs() here:

https://github.com/PecanProject/pecan/blob/4113859c3c88231234aa3f363e2cbe7d99b09b9c/base/workflow/R/start_model_runs.R#L267

To Reproduce

Run an ED2 model (for example) on an HPC

Expected behavior

Less logger output, maybe a message like "job not finished, checking again in s"

Screenshots

Screen Shot 2022-07-27 at 3 27 44 PM

Aariq commented 2 years ago

It also waits too long!

When the job is done, squeue displays an empty table for a long time, which doesn't trigger qsub_run_finished()

For example when the HPC run is done and the ED2 output exists

Then qsub_run_finished() keeps checking for a long time

Screen Shot 2022-08-30 at 10 57 28 AM

Eventually, once squeue returns an error, then qsub_run_finished() is triggered:

Screen Shot 2022-08-30 at 11 01 59 AM

I wonder if there's a better way to do this using or inspired by batchtools

Aariq commented 2 years ago

Ah, so here's what's actually going on. If modellauncher is being used, there's only one job ID, but the code to check if the job is finished is currently in a for-loop that runs for every ensemble (even though they all have the same SLURM job). So if you have 100 ensembles, it's running the ssh -T puma squeue --job 4246012 command 100 times every 10 seconds.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 365 days with no activity.