Semi-lazy judging for jury submissions / submissions with expected result

meisterT commented 11 months ago

(Suggested by @RagnarGrootKoerkamp)

When setting up a contest, jury submissions in the problem ZIP files are imported with an expected result (either coming from the annotation or the folder convention). Then jury submissions are being judged and one can run a judging verifier to see whether the expected results match the actual results.

Typically the jury then goes to the statistics page to check how far below the accepted submissions are below the time limit and how far the time limit submissions are above the time limit.

By default, we have lazy judging enabled and equal judging priority of all incorrect verdicts, so then we will abort judging when we hit the time limit. This might be misleading, imagine the following situation: the time limit is 5s, on the first secret test case the TLE submission times out with 5.001s, so barely above the time limit. The remaining test cases are not judged, so in the statistics overview page you would believe that this submission is close to being accepted although there might be a later test case where the submissions is clearly above the time limit, e.g. takes 20s.

Now, one could argue that the jury should just disable lazy judging, or we do that automatically for them if there is an expected result. However, this approach has the drawback that if a solution is clearly above the time-limit (say on the 2nd secret test case) and there are many test cases (which happens quite often nowadays), it would take "forever".

So you would likely want something semi-lazy in between the current laziness and judging everything (at least for submissions with an expected result). The idea would be to keep on judging even if we hit the "soft timelimit" for one test case until we hit the "hard timelimit" (including overshoot) at least once. This would give an acceptable compromise between more info and amount of time it takes to judge all jury submissions.

This is likely non-trivial to implement. One might first consider introducing a new verdict for the "gray zone" above the time limit (and perhaps even another new verdict for the gray zone below the time limit) with a lower priority than the rest of the incorrect verdicts. Then one could remap from these new verdicts to accepted/TLE as appropriate. However, since we (at least currently) remap each judging_run result as it comes in (and we consider this important if verdicts have a non-uniform priority), we currently lose this information.

Ideas welcome on how to implement this elegantly!

RagnarGrootKoerkamp commented 11 months ago

So for context, we're currently also having a related discussion regarding the problem format. Especially to precisely specify which submissions are allowed / not allowed to be in this grey zone.

One idea Thore had is to introduce AC- and TLE- results indicating a submission ends up having a max runtime in the grey zone.

In BAPCtools I currently track a separate bool timeout_expired for TLE results and simply don't abort judging when timeout_expired is false. (But this doesn't check the margin for AC submissions yet.)

thorehusfeldt commented 11 months ago

Let me just add that AC!, AC, TLE, and TLE! would be alternative names for AC, AC-, TLE-, TLE that may be easier to swallow for some people.

eldering commented 2 months ago

What about the following: if a submission has expected verdict(s) associated with it, then we automatically switch to "semi-lazy" mode if lazy mode was enabled. In this semi-lazy mode we continue judging test cases until at least one of the required verdicts is seen. Moreover, if the verdict is TLE, then we only consider that seen when it is a TLE that hit the hard timelimit, as @meisterT suggested already.

I don't think this is very difficult to implement, as it would all be around https://github.com/DOMjudge/domjudge/blob/main/webapp/src/Controller/API/JudgehostController.php#L1001

DOMjudge / domjudge

Semi-lazy judging for jury submissions / submissions with expected result #2266