LLNL / ATS

ATS - Automated Testing System - is an open-source, Python-based tool for automating the running of tests of an application across a broad range of high performance computers.
BSD 3-Clause "New" or "Revised" License
7 stars 5 forks source link

Update --cutoff to over-ride per test job limits #104

Closed dawson6 closed 1 year ago

dawson6 commented 1 year ago

[Yesterday 4:56 PM] Burmark, Jason ats question, in the Ares tests we sometimes set a specific time for a problem that overrides the default time. I would like to be able to set a default time that would override the specific times if the default time was larger.

[Yesterday 4:59 PM] Burmark, Jason I'm running into an issue where I run tests with special debugging stuff enabled that slows it down. Its easy to increase the default time but if a test has a specific time its not easy to override that.

[7:25 AM] Dawson, Shawn A. Alrighty, this may be a good project for Zakharchanka, Mikhail .

I ran some tests and confirmed the order of operations for the --timelimit option.

1) --timelimit on test line over-rides 2) --timelmit on command line which over-rides 3) default time limits.

​[7:26 AM] Dawson, Shawn A. My understanding is that you want a command line option which will over-ride the 'per test' time limit. So if the per test --timelimit=30 (time seems to just take a single digit at this time, which is minutes) that you could over-ride it on the command line to be 1 minute or whatever. ​[7:27 AM] Dawson, Shawn A. I think that used to be the purpose of the (perhaps not being used at this time) --cutoff option. The help states this for --cutoff and --time ​[7:27 AM] Dawson, Shawn A. -t TIMELIMIT, --timelimit TIMELIMIT Set the TIMEOUT default time limit on each test. This may be over-ridden for specific tests. Jobs will TIMEOUT at this time. The value may be given as a digit followed by an s, m, or h to give the time in seconds, minutes (the default), or hours.

​[7:27 AM] Dawson, Shawn A. --cutoff CUTOFF Set the HALTED halt time limit on each test. Over- rides job timelimit. All jobs will be HALTED at this time. The value may be given as a digit followed by an s, m, or h to give the time in seconds, minutes (the default), or hours. This value if given causes jobs to fail with status HALTED if they run this long and have not already timed out or finished.

​[7:29 AM] Dawson, Shawn A. Seems like the --cutoff option (we could rename that perhaps, not sure if any projects are actually using it) could be repurposed to over-tide the time limits specified for each specific job. ​[7:29 AM] Dawson, Shawn A. Does that sound about right Burmark, Jason and Zakharchanka, Mikhail

dawson6 commented 1 year ago

BTW I tested the current behavior with the mc project, what Jason was seeing.

ATS:(level=20, np=10, etc.)

I ran ats with the option --timelimit=5s

and observed that it timed out in 5 seconds.

Then I updated the ATS line to be:

ATS:(level=20, np=10, timelimit=1, etc.)

That did not timeout in 5 seconds, but it did at the 60 second (1 minute) mark.

The per-test option of 'timelimit=1' over-rode the command line option of --timelimit=5s'

Can we get the command line option --cutoff=5s to over-ride that per-test option, so that the test is killed at 5 s.
I believe this may leave tests in the HALTED rather then TIMEOUT state, but that may be just fine.

dawson6 commented 1 year ago

Testing the MR associated with this. Will comment on what I see here

Looked at code, this MR revives the old 'cutoff' to set the 'timelimit' option in confguration.py. Will comment more here with testing results. It further updates the setting of the 'timelmit' in tests.py such that if if the cutoff time is specified on the command line that this will over-ride the 'timelimit' option on the per test case line. The logic in 'tests.py' looks like it will apply the timelimits in this order of precedence.

1) --cutoff command line argument is highest priority 2) per test 'timelimit' is next priority 3) --timelimit command line argument is next priority 4) the ATS default timelimit is 29m which will be used if no other setting is specified.

dawson6 commented 1 year ago

Reading the original ATS pdf documentation I see the following description of --cutoff.
Now we had disabled that over the years, but as we are reviving it, I think we should keep the --cutoff as the name of the option. We should also find and update the documentation and the "ats --help" though to change the references to HALTED to reflect that this update will put these tests in the TIMEOUT category as well.

--cutoff cutofftime This invokes a special mode in which no test is allowed to run longer than cutofftime, regardless of its actual timelimit option. Jobs that reach this threshold are treated as failures in the sense that any jobs depending upon them are not run; but they are given status HALTED rather than TIMEDOUT. The forms for giving the time are the same as for --timelimit.

dawson6 commented 1 year ago

I think that having 'HALTED' and TIMEDOUT as separate categories is not necessary and is more confusing than actually helpful.

The reference to HALTED wrt 'cutoff' is in the file docs/source/ats.rst as well as in file configuration.py with the parser.add_option call which adds the 'cutoff'. Both these files need update to reflect this update which will put the 'cutoff' test into the TIMEDOUT category.

dawson6 commented 1 year ago

Testing with the Kripke test on toss3

### Run 'atslite1' using generated test.ats, which has no per-test timelimits on the test lines

atslite1 --cutoff "10s" gives the following, which seemed to have cutoff 7 tests which exceeded the 10s limit and then skipped the 7 followup tests.

FAILED:  0
TIMEOUT:  7 kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_1), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_11), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_39), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_105), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_115), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_129), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_139)
PASSED:   206
SKIPPED:  7
dawson6 commented 1 year ago

Run atslite1 --timelimit "10s"

results in the following, which seems reasonable as well. Whether 5 or 7 tests fail to complete in 10s may be dependent on the actual run time which may vary.


FAILED:  0
TIMEOUT:  5 kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_1), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_11), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_105), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_115), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_129)
PASSED:   210
SKIPPED:  5
dawson6 commented 1 year ago

atslite1 --timelimit "2s"

Could be reasonable.

FAILED:  0
TIMEOUT:  24 kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_1), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_3), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_5), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_7), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_39), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_91), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_105), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_115), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_119), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_123), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_129), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_135), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_139), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_153), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_159), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_163), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_171), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_177), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_183), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_187), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_195), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_201), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_211), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_219)
PASSED:   172
SKIPPED:  24
dawson6 commented 1 year ago

atslite1 --cutoff "2s"

FAILED:  0
TIMEOUT:  21 kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_1), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_3), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_5), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_7), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_39), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_97), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_105), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_115), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_123), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_129), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_135), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_139), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_153), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_163), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_171), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_177), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_187), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_195), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_201), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_211), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_219)
PASSED:   178
SKIPPED:  21
dawson6 commented 1 year ago

For these next tests, I am setting the following on each test case int the Kripke test.ats file: timelimit='2s'

atslite1 (ie no other setting of timelimit)

This seems reasonable.

FAILED:  0
TIMEOUT:  21 kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_1), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_3), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_5), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_7), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_39), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_91), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_105), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_115), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_123), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_129), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_135), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_139), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_153), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_163), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_177), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_183), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_187), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_195), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_201), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_211), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_219)
PASSED:   178
SKIPPED:  21
dawson6 commented 1 year ago

Now, lets add -cutoff to see if we can over-ride the '2s' per test limit specified in the deck.

atslite1 test.ats --cutoff "30s"

FAILED:  0
PASSED:   220

yep, that looks good

dawson6 commented 1 year ago

Now lets test the --timelimit command line option, which should not over-ride the per test lmits. We should still get 21 or so test timeouts.

atslite1 test.ats --timelimit "30s"


FAILED:  0
TIMEOUT:  22 kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_1), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_3), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_5), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_7), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_39), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_91), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_105), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_115), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_123), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_129), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_135), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_139), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_153), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_163), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_171), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_177), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_183), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_187), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_195), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_201), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_211), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_219)
PASSED:   176
SKIPPED:  22

ok that looks good

dawson6 commented 1 year ago

Now, lets up the per test time limits specified in the test.ats file to be 30s which is long enough to let any test pass.

If I run

atslite1 test.ats --timelimit "2s"

all tests should still pass, as the per test setting of 30s has higher priority than the --timelmit command line option. Here is what I get

FAILED:  0
PASSED:   220

so that checks out

dawson6 commented 1 year ago

Now lets test with this:

atslite1 test.ats --cutoff "2s"

Which should result in 21 or so timeouts as that command line cutoff option will override the 30s on the test line itself.


FAILED:  0
TIMEOUT:  23 kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_1), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_3), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_5), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_7), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_39), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_91), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_105), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_115), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_119), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_123), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_129), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_135), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_139), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_153), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_163), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_171), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_177), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_183), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_187), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_195), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_201), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_211), kripke(/usr/WS2/dawson/Git-ATS-GitHub-dawson/test/Kripke/kripke-v1.2.5-20e9ea9/build/bin/kripke.exe_219)
PASSED:   174
SKIPPED:  23
dawson6 commented 1 year ago

OK, those look good. Now, I need to test that we see the same thing when using a 'flux' or an 'lsf' install as well, but I think toss3 looks correct.

dawson6 commented 1 year ago

Comments from 'Teams' on this issue and the use of the -t option with flux, which leads to job failures rather than timeouts.

[7:21 AM] Dawson, Shawn A. So after letting it sink in overnight, and running with the three schedulers with verbose mode to see what they are doing, FLUX is the only one which actually gives the scheduler the time limit. We pass -t to flux, but there is no equivalent option being used (or available possibly) with slurm and lsf.

So I think in this case, when most folks are not yet on FLUX, we should comment out the setting of the "-t" argument passed to flux and let ATS kill the job rather than flux reporting a failure. Part of this is that is not obvious from the error log that the flux job died due to timeout, rather it just says something about credentials not valid, which will be confusing to the user. It doesn't indicate that the job exceeded the time limit given to it, but flux yanks the credentials (whatever that means in flux speak) and the job dies. So for a user it is confusing as to why the job died.

It think it is better, at least at this point, to let ATS cancel the job and then list it as a TIMEOUT which is more meaningful.

So let's try commenting out these lines in the fluxscheduled.py file and test out flux runs to see if the honor the --timelimit and --cutoff as expected.

    max_time = self.timelimit
    ret.append(f"-t{​​​​​​​max_time}​​​​​​​")

​[7:23 AM] Dawson, Shawn A. Part of this is to keep things consistent between the three schedulers. When we deprecate slurm and lsf and only use flux in the future, we can revisit this to see if we want to use the flux time limits. <https://teams.microsoft.com/l/message/19:53a10cb3fbc745cfa70aa0c38010493b@thread.skype/1678890111138?tenantId=a722dec9-ae4e-4ae3-9d75-fd66e2680a63&amp;groupId=07b41973-fc19-4954-88da-63aae39c8ca1&amp;parentMessageId=1678722132238&amp;teamName=ATS Testing System&channelName=General&createdTime=1678890111138&allowXTenantAccess=false>

dawson6 commented 1 year ago

@MishaZakharchanka

In testing with flux, I am seeing too many timeouts and I know the reason. This dropped out of my head when we were discussing this yesterday, but I recall now why we were passing the -t(time) limit option to flux whereas we were not with slurm and lsf.

Basically, flux is a better scheduler and we want to give flux all the jobs up front ideally, and let flux just handle them.

This is different than slurm or lsf, where ATS throttles the tests it give to slurm or lsf. With slurm and lsf, ATS tracks the number of cores and nodes in use and when they hit , max it stops and does not submit more until others have finished. Thus ATS can use the 'time submitted' as equivalent to the start time of the job and cancel jobs based on the cutoff or timelimits.

But ideally ATS should not be doing this, as that record keeping should be handled by the scheduler, and we want to remove all ATS book-keeping of number of nodes or cores in use and just let the scheduler (flux) handle it. Which means, that with flux, we can not use the time we submitted the test as the actual start time, as it may start much later than the submittal.

Hence we do need to rely on the -t(time) option being properly passed to flux so that flux ends the job when the timelimit is reached.

With current flux, this does mean that the jobs will be marked as FAILED not timeout, as that is what ATS sees -- the job fails when flux pulls the credentials at the time limit.

So I think we will just have to live with this for now. That is, when using flux and the timelimit is reached, the tests will go into the FAILED category not the TIMEOUT category. We can do a folllow-up request to try and differentiate timeout failures under flux from general failures, but for now we will just live with this and update the documentation noting this behavior under flux.

So, sorry for the yo-yo, but I guess we do need to put back in the -t(time) option in flux and be careful that when using flux we do not have ATS apply the timelimit (either from the deck, or from --cutoff or --timelimit) for if we do, we will be killing jobs too early.

We can discuss this further today.