WIPACrepo / lta

Long Term Archive
MIT License
2 stars 2 forks source link

Request minimum and maximum time for slurm jobs at NERSC #258

Closed blinkdog closed 1 year ago

blinkdog commented 1 year ago

NERSC is 7 hours behind UTC:

icecubed@perlmutter:login07:~/lta/slurm-logs> date --utc
Fri 26 May 2023 10:38:12 PM UTC
icecubed@perlmutter:login07:~/lta/slurm-logs> date
Fri 26 May 2023 03:38:24 PM PDT

When running a nersc-mover (the LTA component that puts zip bundles on tape in HPSS), we started at 14:31 (local time) and the job is killed at 21:41 (utc time) --> 14:41 (local time):

2023-05-26 14:31:14,323 [MainThread] INFO  (component.py:93) - nersc_mover 'login28-pipe0-nersc-mover' is configured:
...
slurmstepd: error: *** JOB 9608169 ON login28 CANCELLED AT 2023-05-26T21:41:22 DUE TO TIME LIMIT ***

This means our job was cancelled after 10 minutes. This is not going to work for processes that can take a long time, like running a checksum on 100GB or copying 100GB into the HPSS tape system.

Moreover, it's difficult because the LTA doesn't get a chance to clean up. A claimed bundle isn't returned to the queue of unclaimed work. It sits claimed (so no other components will work on it) but is not processed. We don't detect this kind of error until 'Hey, why hasn't there been any progress on bundle X?'

So this PR introduces the --time and --time-min flags to our call to sbatch. By default it requests a 6 hour minimum run time and a 12 hour maximum run time. While it's unlikely that a single bundle could require 6 hours, it's possible that a single component could live for a long time if it continuously finds work to do.

It may be necessary to make components smart enough to know how long they've been alive, and make a configurable lifetime, so the component stops processing when it gets too close to the limit. We'll see how it goes, and if this is necessary in practice.