NREL / buildstockbatch

Other
22 stars 14 forks source link

Job get canceled due to time limit #439

Closed yingli-NREL closed 4 weeks ago

yingli-NREL commented 8 months ago

Describe the bug When running ResStock in Kestrel, I got some failed jobs for the time limit error. A few weeks ago, just a few jobs (around 5/50) failed. But this week, this problem becomes more serious. All 3 jobs, or 2 of 3 jobs failed. The error message in the job.out-*, is

DEBUG:2024-03-04 16:35:42:buildstockbatch.base:Using OpenStudio version: 3.7.0 with SHA: d5269793f1
DEBUG:2024-03-04 16:35:42:__main__:Output directory = /kfs2/projects/redlineres/tcm/summer_phoenix_tcm1_0304
slurmstepd: error: *** JOB 2828658 ON x3001c0s33b0n0 CANCELLED AT 2024-03-04T16:45:29 DUE TO TIME LIMIT ***

For the job that successfully finished, it even took 10 min to finish the run for some jobs.

DEBUG:2024-03-05 12:48:35:buildstockbatch.base:Using OpenStudio version: 3.7.0 with SHA: d5269793f1
DEBUG:2024-03-05 12:48:35:__main__:Output directory = /kfs2/projects/redlineres/tcm/winter_boston_tcm1_0305
DEBUG:2024-03-05 12:58:07:__main__:Trimming buildstock.csv
DEBUG:2024-03-05 12:58:07:__main__:Buildstock.csv trimmed to 168 rows.
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 104 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 out of 208 | elapsed:   16.2s remaining:  2.0min
[Parallel(n_jobs=-1)]: Done  49 out of 208 | elapsed:   17.9s remaining:   58.2s
[Parallel(n_jobs=-1)]: Done  73 out of 208 | elapsed:   20.8s remaining:   38.5s
[Parallel(n_jobs=-1)]: Done  97 out of 208 | elapsed:   23.7s remaining:   27.2s
[Parallel(n_jobs=-1)]: Done 121 out of 208 | elapsed:   28.4s remaining:   20.4s
[Parallel(n_jobs=-1)]: Done 145 out of 208 | elapsed:   30.7s remaining:   13.4s
[Parallel(n_jobs=-1)]: Done 169 out of 208 | elapsed:   32.5s remaining:    7.5s
[Parallel(n_jobs=-1)]: Done 193 out of 208 | elapsed:   34.0s remaining:    2.6s
[Parallel(n_jobs=-1)]: Done 208 out of 208 | elapsed:   37.5s finished
INFO:2024-03-05 12:58:45:__main__:Simulation time: 0.63 minutes
INFO:2024-03-05 12:58:45:__main__:Writing results to /kfs2/projects/redlineres/tcm/winter_boston_tcm1_0305/results/simulation_output/results_job1.json.gz
INFO:2024-03-05 12:58:45:__main__:Compressing simulation outputs to /kfs2/projects/redlineres/tcm/winter_boston_tcm1_0305/results/simulation_output/simulations_job1.tar.gz
INFO:2024-03-05 12:58:46:__main__:batch complete
INFO:2024-03-05 12:58:46:__main__:Cleaning up /tmp/scratch
DEBUG:2024-03-05 12:58:46:__main__:Removing /tmp/scratch/buildstock
DEBUG:2024-03-05 12:58:46:__main__:Removing /tmp/scratch/weather
DEBUG:2024-03-05 12:58:47:__main__:Removing /tmp/scratch/output
DEBUG:2024-03-05 12:58:47:__main__:Removing /tmp/scratch/housing_characteristics
DEBUG:2024-03-05 12:58:47:__main__:Removing /tmp/scratch/openstudio.simg

real    10m31.240s
user    39m14.246s
sys     11m14.233s

Platform:

Workaround method Increase the minutes_per_sim to a larger value. For example, I use minutes_per_sim=6 for a simulation with 600 models. If only a few jobs failed, rerun the failed jobs. https://buildstockbatch.readthedocs.io/en/stable/run_sims.html#re-running-failed-array-jobs

afontani commented 8 months ago

Discussing in the development meeting:

yingli-NREL commented 8 months ago

For successfully finished job. It's pretty common that Trimming buildstock.csv takes around 8-10 minutes (14 from 21 jobs).

nmerket commented 8 months ago

For successfully finished job. It's pretty common that Trimming buildstock.csv takes around 8-10 minutes (14 from 21 jobs).

Yeah, that shouldn't take that long. Something weird is going on there.

nmerket commented 7 months ago

This might be fixed in #438.

rajeee commented 4 weeks ago

This is most likely related to series of issue we dealt with in Kestrel before the latest system time. Closing this for now.