ReactionMechanismGenerator / AutoTST

AutoTST: A framework to perform automated transition state theory calculations
Other
32 stars 16 forks source link

Made submit and check_complete methods more robust #57

Closed nateharms closed 4 years ago

nateharms commented 4 years ago

There was an issue where our rate of checking if a job was complete was too frequent and SLURM would frequently return errors looking like the following:

slurm_load_jobs error: Socket timed out on send/recv operation

or

sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

After speaking with some people at Research Computing at Northeastern, we were told that our code was running squeue and sbatch too frequently and that we can check every minute (at minimum). This sets the time.sleep() to 90s so we avoid this issue. We also added changes to make sure that jobs are submitting by waiting for the queue to get smaller so our jobs don't overwhelm it.

codecov[bot] commented 4 years ago

Codecov Report

Merging #57 into master will increase coverage by 0.18%. The diff coverage is 68.13%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #57      +/-   ##
==========================================
+ Coverage   62.73%   62.91%   +0.18%     
==========================================
  Files          27       27              
  Lines        4564     4635      +71     
==========================================
+ Hits         2863     2916      +53     
- Misses       1701     1719      +18
Impacted Files Coverage Δ
autotst/job/job_test.py 98.63% <100%> (+0.16%) :arrow_up:
autotst/job/job.py 41.99% <62.82%> (+3.09%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update dcebf50...792f17d. Read the comment docs.

nateharms commented 4 years ago

Okay, I'm thinking this is okay to merge this in even though the patch wasn't successful. We only changed 15 seconds to 90 seconds so nothing should have changed. Any complaints?

nateharms commented 4 years ago

Okay, I'm gonna do a rebase to fix the history and then add some tests to job.py to account for the decrease in coverage, but then this should be good to go.

nateharms commented 4 years ago

Okay, this is looking like all the checks pass and the coverage is good. I’m going to do a rebase to fix up the git history. But it should be ready for review soon.

nateharms commented 4 years ago

Hi there @davidfarinajr and @rwest, can one of y'all give this a review when you get a chance? Thanks!

rwest commented 4 years ago

It looks like more changed than is described in the pull request description. Problem with rebasing?

nateharms commented 4 years ago

@rwest sorry about that. This PR was originally just about a sleep issue but we broadened the scope of the PR after speaking with @davidfarinajr in person. The PR description now reflects these changes.