Closed thorehusfeldt closed 9 months ago
@thorehusfeldt Could you elaborate on and give examples of the syntax of the .yaml
file where these are specified?
This seems to only talk about the "expectations" and not how or (at least not very explicitly) on what they are applied to.
Specifically:
#expectation: { permitted_verdicts?: [...#verdict] // only these verdicts may appear required_verdicts?: [...#verdict] // at least one of these verdicts must appear permitted_timings?: [...#timing] // only these timings may appear required_timings?: [...#timing] // at least one of these timings must appear
This is treating the time check exactly like the verdict. I.e. literally splitting out the old (v0.2?) verdicts into new (v0.3) verdict, timing pairs. E.g.AC-
is now AC, fast enough
and TLE
is TLE, too slow with margin
.
I don't think this is the right approach (and, I would claim that this is the same approach as the v0.2, since it's just a trivial split). I don't think we should be setting explicit requirements on the timing states in this fasion. Could you give some examples of where this would be either needed, or at least very useful?
We were also talking about this regarding DOMjudge specifically. @meisterT wrote his thoughts after our discussion at: https://github.com/DOMjudge/domjudge/issues/2266
Interestingly, he also comes up with this idea of new verdicts for when submissions are inside the safety_margin
.
There are quite a few things going on here.
We want to specify:
submissions/{accepted,wrong_answer,time_limit_exceeded,run_time_error}
directories mean exactly in the current spec? Do implementations agree?submissions/expectations.yaml
?Let me try to go over this.
submissions/*
mean in the current situationSpec The spec writes on example submissions and problem timing.
directory | permitted results | required results | used for timing |
---|---|---|---|
accepted |
AC | AC | yes |
wrong_answer |
AC,WA | WA | no |
time_limit_exceeded |
AC,WA,TLE | TLE | yes |
run_time_error |
AC,WA,TLE,RTE | RTE | no |
accepted
.[time_limit / ac_to_time_limit, time_limit * time_limit_to_tle)
. In the
table below, that corresponds to disallowing AC-
and TLE-
as required
verdict in that row.Here is the same table using Thore's AC-
/TLE-
notation:
directory | permitted | required |
---|---|---|
accepted |
AC,~AC-~ | AC (n.b. not AC-) |
wrong_answer |
AC,AC-,WA | WA |
time_limit_exceeded |
AC,AC-,WA,TLE-,TLE | TLE (n.b. not TLE-) |
run_time_error |
AC,AC-,WA,TLE-,TLE,RTE | RTE |
BAPCtools
problem timing
/safety margins.RTE
and TLE
the same 'priority'. The spec above requires
that TLE
submissions are run on all cases to ensure they do not RTE
,
but that would 'take forever' in practice.Problemtools TODO
time_limit_exceeded
submissions never give RTE
???DOMjudge
accepted
directories get treated wrongly for submissions that give more
than one non-AC
result.submissions/*
to mean?Observations:
AC
results).submissions/rejected
when more than one
thing non-AC
result may occur.Ragnar prefers being more strict:
Make time_limit_exceeded and run_time_error not allow other results, i.e.: |
directory | permitted results | required results | used for timing |
---|---|---|---|---|
time_limit_exceeded |
AC,~WA~,TLE | TLE | yes | |
run_time_error |
AC,~WA~,~TLE~,RTE | RTE | no |
Should wrong_answer/
and run_time_error/
be used for time limits?
(Equivalent: Is it OK for wrong_answer
and run_time_error
submissions
to be inside the safety margin?)
For simplicity Ragnar prefers the current status of not using them for time limits.
Let's say we would use wrong_answer/*
and run_time_error/*
for time limit determination, would
we then take the maximum over passing testcases, or over all testcases? I would
say we should include all cases. (Note that I think this would require an
additional WA-
and RTE-
verdicts to specify?)
Thore already suggested does_not_terminate
and not_accepted
(what we in
NWERC call rejected
so far). We also use mixed
, which is for submissions
that may or may not time out or crash (but do not WA
).
directory | permitted results | required results | used for timing |
---|---|---|---|
does_not_terminate |
AC,TLE,RTE | TLE,RTE | no |
not_accepted (rejected ?) |
AC,WA,TLE,RTE | WA,TLE,RTE | no |
mixed |
AC,TLE,RTE | AC,TLE,RTE | no |
(Q1) question: do we allow regex and/or glob in the submission path and testcase path? The below assumes glob only.
used_for_timing
My proposal is to add a used_for_timing: bool
field. Not sure whether true
or false
should be default. I used false
as a default for now.
true
: the verdict must not be in [time_limit / ac_to_time_limit, time_limit * time_limit_to_tle)
. This can either be verified for a given time
limit, or can be used as additional constraints (inequalities) the
to-be-determined time limit must satisfy.false
: no additional restriction on timelimit.(Q2) question: What do you think of used_for_timing: bool
? Is it clear and
intuitive enough? Is it expressive enough? E.g. it does not allow setting a
lower multiplier for some unreasonably slow (but still should-be-accepted) submissions.
Example YAML This is only to show syntax. Specific values can be discussed.
The implied default submissions/expectations.yaml
is something like:
# Or just `accepted`? or `accepted/`?
accepted/*:
permitted: ["AC"]
required: []
used_for_timing: true
wrong_answer/*:
permitted: ["AC", "WA"]
required: ["WA"]
time_limit_exceeded/*:
permitted: ["AC", "TLE"]
required: ["TLE"]
used_for_timing: true
run_time_error/*:
permitted: ["AC", "RTE"]
required: ["RTE"]
does_not_terminate/*:
permitted: ["AC", "TLE", "RTE"]
required: ["TLE", "RTE"]
rejected/*:
permitted: ["AC", "WA", "TLE", "RTE"] # NOTE: This could be omitted.
required: ["WA", "TLE", "RTE"]
mixed/*:
permitted: ["AC", "TLE", "RTE"]
required: ["AC", "TLE", "RTE"]
or using abbreviations the below would be equivalent:
accepted/*: accepted
wrong_answer/*: wrong answer
time_limit_exceeded/*: time limit exceeded
run_time_error/*: runtime excepted # TODO: Naming of this
does_not_terminate/*: does not terminate
rejected/*: rejected
mixed/*: mixed
Further it is possible to specify verdicts for testgroups and make assertions on
the score of a submission and a judgemessage
that must appear for at least one
testcase:
# AC that is not used for timing purposes and may be close to the timelimit.
sluggish_ac/*:
permitted: ["AC"]
required: []
# used_for_timing: false # (default value)
wrong_answer/caught_by_samples.py:
sample:
required: ["WA", "TLE", "RTE"]
# Or equivalently:
wrong_answer/caught_by_samples.py:
sample: rejected
# Or more precisely
wrong_answer/wa_on_samples.py:
sample: wrong answer
# Note that this also still checks that the submission is WA (and not RTE/TLE) on all other cases because of its directory.
# Maybe we want to also have 'hidden' edge cases only exposed in secret data:
wrong_answer/wa_on_secret.py:
sample: accepted
# TODO: Exclude a single submission from time limit determination:
# (How do we do overriding/inheriting when multiple globs/regexes match a submission? )
accepted/kinda_slow.kt:
permitted: ["AC"]
required: []
used_for_timing: false
YAML integration for @niemela, even though this is not the current focus. (We’re still at what to specify, not how to specify it.)
The simplest YAML
file you could have would be one that merely specifies the expected behaviour of submission under /submissions
. It would look like this:
accepted: accepted
wrong_answer: wrong answer
time_limit_exceeded: time limit exceeded
runtime_exception: run time exception
One could say that this expectations.yaml
file
submissions/expectation.yaml
file is presentIn particular, a problem development system can completely ignore all YAML, as long as it treats the submissions in /submissions
as specified above, because those expectations are part of the spec. Of course, an easy way to implement a problem development system that is consistent with the specification would just be to implement the logic of the whole expectations framework as specified above; this gives compatibility for free and full generality.
Aside from pinning down the spec, the definitions above also allow bespoke directories such as the following for “all the sluggish java submissions”
other_ac:
permitted: ["AC"]
or more fine-grained specifications, such as Thore promising that his WA
greedy submission does pass on the samples:
wrong_answer/thore.cpp:
sample: permitted_verdicts: ["AC"]
You can go arbitrarily crazy with this and specify the behaviour of individual submissions on individual testcases (or groups):
other_ac/Thore.java:
secret/huge/04-connected-random:
score: [36]
The matching rules (do we prefix-match, glob, regex, wildcard) are currently in flux and downstream from us finally agreeing on how to specifiy #expectations
.
@RagnarGrootKoerkamp : thanks for the table. I think the second table should not have AC-
under permitted
. The meaning of accepted
is that all verdicts are AC
(fast enough with margin).
@RagnarGrootKoerkamp : I played around with used_for_timing
or something like that myself (apply_with_margins
). I am worried about the non-monotone/asymmetric aspect, so to me it feels as if we invite a can of hurt downstream by identifying
as a single concept. If you’re kinda confident that no terrible things will happen, I can live with the single boolean field.
2. What do we want these directories to mean, if different? Are we OK breaking backwards compatibility?
Yes. But we should only do it if it has a significant benefit.
- I'm ignoring the edge case of zero testcases for
accepted
.
What is that edge case? Zero accepted
is not allowed. (Or is that the edge case?)
- DOMjudge does not work well with 'mixed' results (i.e. multiple non-
AC
results).- Because of this, BAPC/NWERC jury use
submissions/rejected
when more than one thing non-AC
result may occur.
Another reason to use this is that you just don't case how it fails, only that it fails. I think it makes sense to have such a directory.
Make
time_limit_exceeded
andrun_time_error
not allow other results
I think I agree. Especially if we also have rejected
.
Thore already suggested
does_not_terminate
andnot_accepted
(what we in NWERC callrejected
so far). We also usemixed
, which is for submissions that may or may not time out or crash (but do notWA
).
If I understand the canonical use of does_not_terminate
correctly (i.e. "this is a brute force solution that is always correct, but often too slow") then I think the name is very bad. The defining property is not that it does not terminate, it's that if it finishes before we kill it or it crashes, it is correct.
The CLICS name for not_accepted
is rejected
. When we have established terminology I think we should use it.
I don't love the name mixed
either. Why is it needed? Why would not rejected
be enough?
(Q1) question: do we allow regex and/or glob in the submission path and testcase path? The below assumes glob only.
Glob only sounds good to me. Is there an exact definition/specification that we could use? I couldn't find one immediately.
(Q2) question: What do you think of
used_for_timing: bool
? Is it clear and intuitive enough? Is it expressive enough? E.g. it does not allow setting a lower multiplier for some unreasonably slow (but still should-be-accepted) submissions.
I think that used_for_timing
covers all what we need in practice. In theory it would be better to instead be allowed to optionallt override the ac_margin
or tle_margin
on a per submission basis. I'm torn.
I'm not really sure if it makes sense to have used_for_timing
on non-AC
or TLE
verdicts.
or using abbreviations the below would be equivalent:
I'm not convinced that the abbreviations are needed.
# TODO: Exclude a single submission from time limit determination: # (How do we do overriding/inheriting when multiple globs/regexes match a submission? ) accepted/kinda_slow.kt: permitted: ["AC"] required: [] used_for_timing: false
Up until we introduced used_for_timing
my thinking was that every permitted/required/score creates a rule, and rules can't be removed. E.g. this would always fail:
accepted/*:
permitted: ["WA"]
Because permitted: ["AC"]
is already a rule for accepted, and now it's required that all verdicts must be in ["AC"]
but also in ["WA"]
.
With used_for_timing
that would not be useful. That setting would have to override, and I would suggest it does so in yaml-file order. I.e. rules are applied from top to bottom.
This means that the example a bit above could be written as:
accepted/kinda_slow.kt:
used_for_timing: false
The simplest
YAML
file you could have would be one that merely specifies the expected behaviour of submission under/submissions
. It would look like this:
The simplest file, would be the empty file, which would then not have to exist.
I would suggest that the semantics should be that the default directory behavior is assumed to be added to the top of the file. I.e. the accepted/kinda_slow.kt
example just above would be a complete file that has all the default behavior except that accepted/kinda_slow.kt
is not used for timing.
I am worried about the non-monotone/asymmetric aspect, so to me it feels as if we invite a can of hurt downstream by identifying
1. _all_ timings must satisfy the timing constraint _with a margin_ 2. _some_ timing must satisfy _another_ timing contraint with a _different_ margin
as a single concept. If you’re kinda confident that no terrible things will happen, I can live with the single boolean field.
@thorehusfeldt I don't understand what you're saying here, so I can't be confident that it wont happen. Could you explain more?
- I'm ignoring the edge case of zero testcases for
accepted
.What is that edge case? Zero
accepted
is not allowed. (Or is that the edge case?)
As I understand it, the edge case where there are no testcases within a test group.
- I'm ignoring the edge case of zero testcases for
accepted
.What is that edge case? Zero
accepted
is not allowed. (Or is that the edge case?)As I understand it, the edge case where there are no testcases within a test group.
Ah, I realize now that I read "submission" and not "testcases". This makes much more sense. Thanks :).
That said... is that an edge case? Every "permitted" rule will always succeed on an empty set, (and every non-empty required rule will always fail). So any accepted submission will always succeed on the empty set of testcases. That feels right, right? (Useless of course, but right).
Great progress. Let me try to pin down used_for_timing
precisely.
First, let me change the name to with_margins
. This is preferable because it indicates a constraint rather than a procedure. (It is also slightly less open to misunderstandings. All submissions have “something to do with timing”, so used for timing
means optionally_used_for_determining_the_time_limit
.)
In particular, there could be traditions where the time_limit
is set explicitly in problem.yaml
and then the role of with_margins
is to check that all the AC
-submissions pass their timing constraint with margin. (Instead of setting “the timing.”)
With that out of the way: (recall that an expectation is something a submission has on a set of #result
s. Call this set R
The expectation
#expectation.with_margin
means that
- every
result
inR
with verdictAC
satisfies the stricter constraintresult.time < time_limit / ac_margin
.- if
R
contains a result with verdictTLE
then someresult
inR
with verdictTLE
satisfies the stricter constraintresult.time >= time_limit * tle_margin
.
This is the best I can do, pay particular attention to the if
-part of item 2. (If we don’t have that, all hell breaks lose, as far as I can tell.) Please improve.
—
The alternative is to separate the concepts “all AC
must run with margin” and “there is at least one TLE
that is rejected with margin”, as in version 0.3 above (or simplifications.)
(Useless of course, but right).
These situations happen all the time, for instance when you run the -d
flag in problemtools
. That allows you to restrict data to, say, “every in-file that contains the substring huge
”, and such restrictions routinely lead to empty testgroups during development. (For instance, a -d
-filtered sample
is often empty.) problemtools
crashes in such situations, such as reported in this issue: https://github.com/Kattis/problemtools/issues/189 (and subsequently fixed)).
The expectation
#expectation.with_margin
means that
- every
result
inR
with verdictAC
satisfies the stricter constraintresult.time < time_limit / ac_margin
.- if
R
contains a result with verdictTLE
then someresult
inR
with verdictTLE
satisfies the stricter constraintresult.time >= time_limit * tle_margin
.
I don't think the definition should be over whether R
contains AC
, but rather whether there is a a rule containing AC
. Specifically, a submission in time_limit_exceeded
would probably have some AC
verdicts (the samples if nothing else), and with this definition all such verdicts are required to satisfy the stricter lower constraint. That is not what we want.
(Useless of course, but right).
These situations happen all the time, for instance when you run the
-d
flag inproblemtools
. That allows you to restrict data to, say, “every in-file that contains the substringhuge
”, and such restrictions routinely lead to empty testgroups during development. (For instance, a-d
-filteredsample
is often empty.)problemtools
crashes in such situations, such as reported in this issue: Kattis/problemtools#189 (and subsequently fixed)).
I think this means that you agree that it is correct?
with this definition all such verdicts are required to satisfy the stricter lower constraint. That is not what we want.
You correctly observe that a symmetric concept like “we want to use this for time-limit-y stuff” is ill suited to express two intentions (“we want AC
to mean this-and-that for all AC
results” and “we want TLE
to mean this-and-that, for some TLE
results.) Hence my squeamishness in adopting this idea.
However, I try to be constructive and help by at least attempting to write down what used_for_timing
could mean. It would be more helpful if one of you (who reject my proposals for more fine-grained AC
-verdicts, or for #timing
) would clearly specify the intended semantics.
As I understand it, the edge case where there are no testcases within a test group.
For accepted
submissions, we say permitted: ["AC"]
, but if you also put required: ["AC"]
that fails when there are 0 testcases. So we should probably stick with required: []
for this case.
Fredrik: I don't love the name
mixed
either. Why is it needed? Why would notrejected
be enough?
rejected
is for submissions that must fail in some way.
mixed
is for submissions that may or may not fail, e.g. depending on hardware speed. I think we also occasionally use this currently for submissions that AC in practice, but should not be used for timelimit determination.
Fredrik: Q1: Glob only sounds good to me.
That's also my weak preference.
Fredrik: Q2: I think that
used_for_timing
covers all what we need in practice.
Yes, this is also my feeling.
Fredrik: With
used_for_timing
that (appending rules) would not be useful. That setting would have to override, and I would suggest it does so in yaml-file order. I.e. rules are applied from top to bottom.
I agree this is nice, but sadly YAML dictionaries are explicitly unordered :/ We'll have to think about this more.
Thore: First, let me change the name to
with_margins
.
I think that works for me. Give me a day to get used to it :)
Thore: It would be more helpful if one of you would clearly specify the intended semantics.
Ok so the way I initially think about with_margins: true
is that the maximum runtime over all current testcases (i.e. with any result) must not be in [time_limit / ac_to_time_limit, time_limit * time_limit_to_tle)
. (Current is explained later.)
Note:
permitted
and/or required
.permitted
check. It's more similar to required
in that it requires all results to determine whether the check passes.Examples:
permitted: ["AC"]
, then we get two checks: that each testcase is AC
(i.e. <= time_limit
), and that the max over all testcases is <= time_limit / ac_to_time_limit
(or >= time_limit * time_limit_to_tle
, but assuming all are AC
that's impossible).permitted: ["AC", "WA"]
, we do require that the WA
cases finish in <= time_limit / ac_to_time_limit
.permitted: ["AC", "TLE"]
, we just test that it's far away enough from the time limit, and either side is OK. (I don't think this is useful in practice, but it has a well defined meaning nevertheless.)Let's think about how this interacts with:
with_margins
for testgroups:Take for example:
partially_accepted/ragnar.py:
sample:
permitted: [`AC`]
with_margins: true
secret:
required: [`TLE`]
with_margins: true
This should (to me at least) mean that on the samples the max running time is below the safety window, and on the secret cases the max running time is above the safety window.
So what we really want is that whereever with_margins: true
appears, the max running time over the testcases to which the with_margins: true
is scoped is outside the safety margins.
(This doesn't have to be part of the spec, which only talks about running/validating the full set of testcases. But still it may be useful to think about how we would handle this case.)
Suppose we run only a subset T
of testcases.
permitted
constraints are easy to handle, since they are on a per-testcase/per-result basis.required
and with_margins: true
say something about a set of testcases S
. Some cases:
S
and T
are disjoint, skip this constraint.S
is a subset of T
, we can evaluate/apply the constraint as usual.S
are in T
:
required: [X]
: if X
already appears, all is good. But if not, you can't give a hard error since X
may occur in an element of S
that is not in T
. A log
or warning
may still be appropriate though.with_margins: true
: Similarly, you can't give a hard error since not all cases are tested. But you can log
/warn
when not satisfied by the current subset of S
that is in T
.As I understand it, the edge case where there are no testcases within a test group.
For
accepted
submissions, we saypermitted: ["AC"]
, but if you also putrequired: ["AC"]
that fails when there are 0 testcases. So we should probably stick withrequired: []
for this case.
Agreed. (But also, leaving out required
should mean required: []
just like leaving out permitted
should mean permitted: [<all of them>]
)
mixed
is for submissions that may or may not fail, e.g. depending on hardware speed. I think we also occasionally use this currently for submissions that AC in practice, but should not be used for timelimit determination.
Right, you defined it as:
mixed/*:
permitted: ["AC", "TLE", "RTE"]
required: ["AC", "TLE", "RTE"]
This seems to me to be the same use case (but ever so slightly different definition) as what @thorehusfeldt calls does_not_terminate
, and what I would like to call something like correct_but_slow
(name is certainly WIP).
I agree this is nice, but sadly YAML dictionaries are explicitly unordered :/ We'll have to think about this more.
Right, I knew that... :(. That's very annoying though.
- This only makes a statement on the maximum runtime over all testcases. It says nothing on a per-testcase/per-result basis.
FTR, I think this is the right approach for used_for_timing
.
with_margins
for testgroups:
Do we actually need to allow used_for_timing
a.k.a. with_margins
on testgroups as well? Do you have some concrete examples for when that granularity would be very useful?
This [
mixed
] seems to me to be the same use case asdoes_not_terminate
.
They are quite similar indeed. The only distinction is that mixed
has required: [AC,TLE,RTE]
, so the submission may also end up being AC
everywhere, whereas does_not_terminate
has required: [TLE,RTE]
, meaning it must fail.
But I agree that mixed
isn't very specific and could possibly be specified by the user on a case-by-case basis.
does_not_terminate
, and what I would like to call something likecorrect_but_slow
yes, I was thinking of something similar.
with_margins
on testgroups
If you have 3 testgroups, A, B, and C, I can imagine some submission that should pass A (with margin), fail B (with margin), and fail C (also with margin).
The running time could explode on C, but you would still want that on B it also fails with margin.
I'm not sure if this 100% makes sense with scoring problems though. There you may want to ensure that the score for group B is 0, and that it times out for each testcase in group B? One way we could do that is to distinguish secret/group_b
(which applies to the entire group) from secret/group_b/*
(which applies individually to each testcase in the group). Surely this could work but I'm not sure it's a good idea.
In the AC-
, TLE-
proposal you could do
submission.py:
secret/group_b:
permitted: [TLE] # Note: not TLE-
whereas now this becomes
submission.py:
secret/group_b/*:
permitted: [TLE]
required: [TLE]
with_margins: true
Here is a case where with_margins
is useful for testgroups. Think of an IOI-style question, where data/secret/group1
is what-IOI-calls-a-subtask. Then I want to specify
mixed/brute-force.py:
sample: accepted
secret/group1: accepted
secret/group2: time limit exceeded
Here it is crucial that time limit exceeded
means (using proposal 0.3 for clarity):
required_verdicts: ["TLE"]
required_timings: ["too slow with margin"]
How about this?
# file glob specifying some set of files. (1)
<file glob>:
# the calculated verdict for the submission (2)
verdict: [<verdicts>]
# score of the submission (3)
score: <number or range>
# permitted and required testcase verdicts (4)
permitted: [<verdicts>]
required: [<verdicts>]
# these submissions are used for timing (5) (6)
used_for_timing: <bool>
# testcase glob specifying some set of testcases or groups. (7)
<testcase glob>:
# score on the testcase or group (8)
score: <number or range>
# permitted and required testcase verdicts (9)
permitted: [<verdicts>]
required: [<verdicts>]
# the result on this testcase or group of this submission is used for timing (10)
used_for_timing: <bool>
# multiple sets of testcases or groups can be specified
<testcase glob>:
permitted: [<verdicts>]
required: [<verdicts>]
score: <number or range>
used_for_timing: <bool>
Verdicts are "AC", "WA", "RTE", "TLE".
(1): Since a program is "always represented by a single file or directory" we can't (or shouldn't?) allow submissions that are not exactly one directory level down. I.e. accepted/sol1.cc
is ok, but accepted/fast/sol1.cc
and sol1.cc
are not.
(2): Set of verdicts. The verdict of the submission must be one of them. Note that this does not directly say anything about verdicts of testcases, and allow judging to be short circuited if that's how the problem is specified.
(3): Score for the submission. If a number must be the exact score given, if a range must be within the range (i.e. >= the lower end and <= the upper end). Note that a non-0 score implies that verdict is "AC". How should matching on score behave when verdict is not "AC"? Should the score always be 0 in that case, should it be allowed to be 0, or should it be "null"?
(4): All verdicts off testcases must be in the permitted set, some verdicts of testcases must be in the required set. Default for permitted is "AC", "WA", "RTE", "TLE"
. Maybe "restricted" is a better name than "permitted"? Not sure. In earlier examples we have used required: []
to mean "nothing is required", but the definition we have used so far actually means that required: []
is an automatic fail. To say "nothing is required" we would have to say required: ["AC", "WA", "RTE", "TLE"]
. Right?
(5): That a submission is used for timing means that it must be possible to set a time limit such that the time for the submission (i.e. the maximum of the times for testcases) is not within any of the margins from the time limit. Note that if the time limit is explicitly set, than that is the only time limit that can be set. Defaults to false. I think this is the same as what @RagnarGrootKoerkamp said?
(6): Alternatively we could have some kind of time_limit_margin
that overrides the problem wide setting (in problem.limits
) for these files. Logic would then be "the same" as under (5). Setting the time_limit_margin
to 0 would have the same effect as setting used_for_timing
to false. We need two margins, so time_limit_margin
should be a tuple?
(7): If these are just any glob than permitted
or score
would technically be allowed. That said any useful glob would have to start with sample
, secret
, or *
. Should we require that?
(8): Score for the testcase/group(s) specified by the glob.
(9): Similar to (4), but applies separately to every testcase/group specified by the glob.
(10): Similar to (5) and (6) but applied to a single testcase/group.
Note that (5), (6) vs (10) implies that the following are different:
accepted/sol1.cc:
used_for_timing: true
accepted/sol1.cc:
*/*:
used_for_timing: true
@niemela nice writeup!
(1): I'm OK with allowing none and arbitrary nesting.
(2): I think Thore explicitly steered away from assertions on the final verdict, and given permitted
and required
, I think we don't need additional assertions on verdict
and we can just leave this out.
(3): I'd say non-AC should simply have score = 0
.
(4): Good point re. required: []
being wrong. Absense of required:
(and possibly required: null
) should be used instead.
(5): yes sgtm
(6): My feeling is that for now it's better to not go with explicit overridable margins yet, since one can always adjust the problem-wide margin if things are wrong. One thing that we could do using more granular margins is e.g. setting a 5x
ac_to_time_limit
margin for C++ and a 1.1x
ac_to_time_limit
margin for python. But I kinda think just setting a fixed margin over the slowest AC is flexible enough. (Having 5x
over C++ I would only do if I didn't have a python submission in the first place and the large buffer is just to compensate for the lack of slow AC submissions. It's better to actually write that slow submissions and then set a lower margin. If python becomes faster in the future, then the time limit goes down, which is good.)
(7) Yes it should start with sample
or secret
. I suppose allowing it to start with a literal *
also works.
(8,9,10) Hmmmm, the grouping/globbing is quite tricky:
secret
and secret/
mean: treat the directory as a single groupsecret/*
means: treat each file individually.secret/*
apply to secret/group1/*
? To the entire group or to each testcase inside it?secret/*-large*
?I think this creates some extra complexity. We should think about if we really need this.
If we do want it, it's probably better to store this alongside the existing fields rather than overloading the glob field.
Thore's original suggestion here between required_timings
and permitted_timings
makes sense: we want a "for all: with_margins
" and a "there exists: with margins
". So maybe we can have with_margins:true
for the max runtime of the selected cases and a each_with_margins:true
to apply to each selected testcase individually. Oh, or we can even do with_margins: "each" | true
.
(11): Looking at Thore's last post above, I do think abbreviations are quite nice, especially when used with testgroups.
That raises the question: How to specify an abbreviation for the full submission, in combination with additional contraints on testgroups? You can't do both submission.py: wrong answer
and submission.py: sample: accepted
at the same time.
Maybe we need an "abbreviation
" field for this?
(11): Looking at Thore's last post above, I do think abbreviations are quite nice, especially when used with testgroups. That raises the question: How to specify an abbreviation for the full submission, in combination with additional contraints on testgroups? You can't do both submission.py: wrong answer and submission.py: sample: accepted at the same time.
You can; my proposal is supposed to exactly make this possible. For a realistic example, consider
thore.*: wrong answer
*.py:
sample: accepted
The submissions thore.py
satisfies both rules, so both rules apply. In particular, the set of all testcase results must include WA
, and the set of sample
testcase results can only include AC
. This is well defined (because all “my” rules have obvious compositionalities.)
(8,9,10) Hmmmm, the grouping/globbing is quite tricky:
Only tricky for scoring problems, and then it gets really tricky indeed. But only some of us care about scoring problems, so I’ve decided to put very little emphasis on that.
That a submission is used for timing means that it must be possible to set a time limit such that the time for the submission (i.e. the maximum of the times for testcases) is not within any of the margins from the time limit. Note that if the time limit is explicitly set, than that is the only time limit that can be set. Defaults to false. I think this is the same as what @RagnarGrootKoerkamp said?
Very cool. Does this really work (if so: great)? Let me try to rephrase. Again, R
is a set of results.
If
with_margins
is true thenmax(r.time)
over allr in R
is either< timelimit/ac_margin
or>= timelimit*tle_margin
.
Is that what we mean?
For a realistic example, consider
thore.*: wrong answer *.py: sample: accepted
What if we want to apply both of these rules only to thore.py
? In that case some kind of workaround like
thore.py: wrong answer
thore.py*:
sample: accepted
may be needed. (But I'm happy accept this as a niche case that's not very pretty.)
If
with_margins
is true thenmax(r.time)
over allr in R
is either< timelimit/ac_margin
or>= timelimit*tle_margin
.Is that what we mean?
Yes :)
We (@Tagl @RagnarGrootKoerkamp @thorehusfeldt and myself) had a fairly long and somewhat productive (IMO) call to discuss the details. Here are some things I noted from that meeting. Not in order of importance, only numbered for reference:
1)
We agree that submissions must be always one directory level down, since otherwise we can't really tell directory full of submissions apart from a directory that is a submission. I.e. these are all acceptable submissions: accepted/sol1.py
, accepted/sol2/{header.h, main.c}
, and these are not: sol3.cc
, accepted/very_fast/sol4.cc
.
2)
The rules for combining permitted
and required
are clear. Every one is a rule, and they all have to be fulfilled. (This is equivalent with saying that permitted
combines to the intersection, but I don't think there is a similar set theory rule for required
?).
3)
The rules for combining settings of used_for_timing
is as of yet unclear. Combining them the same was as (2) would not be useful since doing that with two instances of the same value does nothing, and different values would always fail. Clearly there need to be some overriding. Doing it in file order seems reasonable but is against YAMLs definition of maps (they are explicitly unordered). Toward the end of the call (so we have not thought this through yet) it was suggested that given (1) (i.e. that all submissions is on the form <directory>/<submission>
) we could say that fileglobs that target submissions always override fileglobs that target directories, and that targeting either with contradicting values is an error.
Examples:
accepted:
used_for_timing: true
accepted/sol1.cc:
used_for_timing: false
# this is ok, everything in accepted except accepted/sol1.cc is used for timing.
---
*/*.py:
used_for_timing: false
accepted/really_fast.py
used_for_timing: true
# not ok, accepted/really_fast.py is given contradicting values.
4)
We agreed that setting a tuple of time_limit_margin
is more powerful, kinda nice, but probably overkill. used_for_timing
is simpler and more directly what one would want to do in most cases anyway.
5)
permitted
and required
when set on files always refer to the leafs of the test data group tree structure. I.e. all of the testcases. This implied that permitted
(that is not set to all values) requires a validation to run submissions on all testcases, there can be no short circuit. required
OTOH, can be short circuited as soon as a required testcase verdict has been found. Using either permitted
and required
could force the running of testcases on a submission that would not be run during "normal" judging (if it is short circuited). A (pair of) longer, more correct, but... probably too long names would be permitted_testcase_verdicts
and required_testcase_verdicts
.
6)
verdict
when set on files refer to the top level verdict of the submission. This implies it can be short circuited in the same way that "normal" judging can.
7)
We disagreed quite a bit on what the "testcase globs" could refer to, just testcases, or also test groups. I.e. does secret/group1
refer to the test data group of that name or all the test cases in that group.
8) We disagreed quite a bit on what the semantics got rules on "testcase globs" would be. Applying the rules individually on all test cases (or groups?) would be consistent with how the filename globs work, but applying the rules once to the set of all test cases matched would be more powerful.
9) Taking 7 & 8 together, maybe we should have separate ways of targeting testcases and test groups? (I think that's more complication than it's worth).
I talked some more with @Tagl, he came up with this idea for making the testcase matching just different enough that we could use the more powerful semantics (discussed under 8 above).
Instead of (something like) this:
<file glob>:
verdict: [<verdicts>]
score: <number or range>
permitted: [<verdicts>]
required: [<verdicts>]
used_for_timing: <bool>
<testcase glob>:
score: <number or range>
permitted: [<verdicts>]
required: [<verdicts>]
used_for_timing: <bool>
<testcase glob>:
permitted: [<verdicts>]
required: [<verdicts>]
score: <number or range>
used_for_timing: <bool>
we could do:
<file glob>:
verdict: [<verdicts>]
score: <number or range>
permitted: [<verdicts>]
required: [<verdicts>]
used_for_timing: <bool>
testcases:
- matches: <testcase glob>
score: <number or range>
permitted: [<verdicts>]
required: [<verdicts>]
used_for_timing: <bool>
- matches: <testcase glob>
permitted: [<verdicts>]
required: [<verdicts>]
score: <number or range>
used_for_timing: <bool>
Thoughts?
We (@Tagl @RagnarGrootKoerkamp @thorehusfeldt and myself) had a fairly long and somewhat productive (IMO) call to discuss the details. Here are some things I noted from that meeting. Not in order of importance, only numbered for reference:
accepted/sol2/{header.h, main.c}
always be written as accepted/sol2/
instead (with the last /
optional)?required
it's the union, right? required = S
with a set of testcase verdicts T
means: ∀ s in S: ∃ t in T: s == t
, so if you have two required sets S_1
and S_2
, then the logical and of the two expressions is equivalent to ∀ s in (S_1 ∪ S_2): ∃ t in T: s == t
, I think.permitted
can not be short circuited, but I guess that's just a typo.secret/group1
refer to the whole group and secret/group1/*
to each test case individually?A lot of progress was made!
- Shouldn't
accepted/sol2/{header.h, main.c}
always be written asaccepted/sol2/
instead (with the last/
optional)?
Yes, it should absolutely be written (in the YAML file) as accepted/sol2
. I listed them as accepted/sol2/{header.h, main.c}
to make it clear that I was referring to a submission consisting of multiple files in a directory.
2. For
required
it's the union, right?required = S
with a set of testcase verdictsT
means:∀ s in S: ∃ t in T: s == t
, so if you have two required setsS_1
andS_2
, then the logical and of the two expressions is equivalent to∀ s in (S_1 ∪ S_2): ∃ t in T: s == t
, I think.
No, I don't think so? required
means that at least one of the verdicts given needs to be the result of some test case, not that all of the verdicts given needs to be. At least that's what Thore's specification (see the original message in this thread) says. @thorehusfeldt can you clarify?
5. This should say for
permitted
can not be short circuited, but I guess that's just a typo.
Yes, thanks for catching that. I have fixed the typo in the comment above.
6. Not sure the verdict on submission level belongs here. Did we make verdict aggregation to submission level part of the problem format specification?
Well, this was apparently a point of a lot of divergence. I would argue that "verdict aggregation to submission level" was always part of the format, and furthermore that it absolutely, positively must be. If we don't even agree on what the correct judgement is, it's not much of a standard it is? OTOH, there is a weaker form of agreement we could aim for where we must agree on accepted vs rejected, but are allowed to possibly reject differently (because that matters a lot less).
7. (also point 8, I guess) I'm not sure I understand, but isn't it easy to let
secret/group1
refer to the whole group andsecret/group1/*
to each test case individually?
I think so, yes. If I understand the counterpoint correctly, the testcase rules are supposed to match on sets of testcases, and never groups. With that view, then secret/group1
becomes "the testcases of group1" which is the same set of testcases as secret/group1/*
.
A lot of progress was made!
Yes. :+1:
The most dramatic change that came out of the meeting yesterday – nullifying three months of exchanges in various channels, the consensus of the August meeting in Lund, and all my concrete, thorough, and complete proposals since August – is that we discovered that there is no consensus about the following claim:
expectations are constraints on sets of testcase-verdicts, judgemessages and testcase scores.
Instead, expectations are now something much richer and also express constraints on aggregated verdicts. (These are themselves defined in terms of aggregation rules for sequences of testcase-verdicts.)
Thus, a testcase-result
now also needs to include the name of the testcase (or some other total order of the test-case verdicts
) in order to define #expectations
. This would make it much easer to connect it with testdata.yaml
and allow expectations.score
to be clearly defined. So I’m moderately optimistic.
However, these changes were suggested 38 minutes before the meeting yesterday, and I haven’t had as much time (compared to the three 3 months we’ve discussed proposals 0.1, 0.2, and 0.3) into thinking that through.
I will try to pin down what I think we mean with this and re-emerge with expecations 0.4.
required means that at least one of the verdicts given needs to be the result of some test case,
How can that be hard to find out? I quote myself from expecations 0.3 at the top of this page:
#expectation: {
permitted_verdicts?: [...#verdict] // only these verdicts may appear
required_verdicts?: [...#verdict] // at least one of these verdicts must appear
....
}
BTW, these fields will now probably be renamed to required_testcase_verdicts
and permitted_testcase_verdicts
, in order to distinguish them from the new permitted_aggregated_verdict
, which has very different semantics.
How does this proposal relate to https://github.com/Kattis/problem-package-format/issues/112?
It is the third or fourth iteration.
Does this mean the other issue is obsolete and can be closed? Or does this add on top of the other version?
expectations are constraints on sets of testcase-verdicts, judgemessages and testcase scores.
Instead, expectations are now something much richer and also express constraints on aggregated verdicts. (These are themselves defined in terms of aggregation rules for sequences of testcase-verdicts.)
This claim confuse me. Aggregated verdicts are themselves "constraints on sets of testcase-verdicts" so constraints on testcase-verdicts and aggregated verdicts are still constraints on sets of testcase-verdicts?
Or is the key insight here the difference between sets of testcase-verdicts versus sequences of testcase-verdicts?
@thorehusfeldt:
How can that be hard to find out? I quote myself from expecations 0.3 at the top of this page: [...]
Yes, the section you quote is what I based my understanding of.
The source of the confusion is the suggestion that there is some set theoretic combination of required
. Is there?
You say:
This is well defined (because all “my” rules have obvious compositionalities.)
How does multiple required
compose? I can't fine the description of that?
Does this mean the other issue is obsolete and can be closed?
Yes, I have closed it.
Or does this add on top of the other version?
In the sense that all our discussions build on top of all our earlier discussions and insights, it kinda does "add on top". That said, closing a ticket does not delete it, we can still read and refer to it. (And even reopen if we feel the need).
Aggregated verdicts are themselves "constraints on sets of testcase-verdicts"
They are not; the important word here is “set”.
Aggregated verdicts are constraints on totally ordered sequences of testcase-verdicts. (In particular, they are defined in terms of which non-AC
verdict comes first.)
And aggregate scores are something even richer: they are constraints on trees of verdicts, modified by the aggregation rules of testdata.yaml
.
Sets are not totally ordered sequences are not partial orders. Note for instance the very first paragraph in my own introduction from yestersday:
We want to specify the expected behaviour of submissions on collections of testcases. (Collections could be singletons, sets, lists (so they’re ordered), or trees, optionally with testdata.yaml applying to internal nodes. Not important here.) [https://github.com/thorehusfeldt/problem-package-format/wiki]
This was literally my opening sentence yesterday, and I highlighted the choice of the word “collection” (to encompass all three cases) verbally. You may be underestimating my ambition for precision. :-)
They are not; the important word here is “set”.
Aggregated verdicts are constraints on totally ordered sequences of testcase-verdicts. (In particular, they are defined in terms of which non-
AC
verdict comes first.)
I would maybe say that aggregated verdicts are constraints on sets of testcase, verdict pairs. testcases themselves are, of course, ordered.
This was literally my opening sentence yesterday, and I highlighted the choice of the word “collection” (to encompass all three cases) verbally. You may be underestimating my ambition for precision. :-)
So you did say that wanted to "specify the expected behaviour on [...] lists [and] trees", but that now that we do that is "nullifying three months of exchanges..."? I'm actually very confused.
How does multiple required compose? I can't fine the description of that?
A submission must satisfy all the expectations that apply to it. (For instance, all the expectations that match – by whatever rule we can agree on – its filename. Or are specified in the source code. Or hardcoded into the specification. Or… it’s not important.)
So (by whatever rule) assume submission S
is expected to satisfy expectation E
and F
. Then your expectation-checker could, for instance, validate S
against E
and then validate S
against F
and only accept if both validations were successful.
This would be equivalent to collecting all required
fields in E
and all required
fields in F
and computing their union; as well as collecting all permitted
fields in E
and all permitted
fields in F
and computing their intersection. It’s not important which of these to strategies the checker uses. They give the same result (which, I argue, is also the intuitive semantics.)
This would be equivalent to collecting all
required
fields inE
and allrequired
fields inF
and computing their union; as well as collecting allpermitted
fields inE
and allpermitted
fields inF
and computing their intersection. It’s not important which of these to strategies the checker uses. They give the same result (which, I argue, is also the intuitive semantics.)
No, I believe that's incorrect.
Let's say expectation E is required(WA, AC)
and F is required(TLE, RTE)
. Fulfilling both of these means that "at least one of [WA or AC] must appear" and "at least one of [TLE or RTE] must appear". In this example this requires a minimum of two different things to appear. The union of WA, AC
and TLE, RTE
is WA, AC,TLE, RTE
, and fulfilling required(WA, AC,TLE, RTE)
means "at least one of [WA, AC,TLE, or RTE] must appear". In this example this requires only one thing to appear, so clearly these two things are not the same.
What am I missing?
You're right, I'm wrong. Thank you.
My first semantics is the intention: A submission must satisfy all expectations that apply to it.
Closing this. Discussion continues in #137 .
This is version 0.6 of the expectations proposal.
Goals
Primary Design Goals
Precisely specify the meaning of “verdict” and “expectation”
Define the semantics for setting time limit within “safety” margins from judge submissions
Provide clear semantics for expectations of author submissions placed subdirectories of
/submissions/*
Secondary Design Goals
Allow bespoke expectation profiles beyond "accepted", "wrong answer", etc.
Expectations for judge messages (“on some testcase, the submissions produces
guess out of bounds
”) so that we can specify full test coverage for custom output validatorsExpectations for scores (”this submission gets at most 65 points on all testcases”)
Allow expectations to be specified for test groups like
sample
orsecret/group1
or evensecret/03-petersen_graph
Provide a schema for expectations specified in a separate
expectations.yaml
fileConstraints
Out of scope:
Givens
Relevant metadata is given in
problem.yaml
specified as followsFor readability, the present document introduces aliases to the metdata specified in
#problem
:Proposal: Verdicts and timings
Result
For precision, we define the result of running a submission on a testcase for a single-pass problem.
Decisions made here, this is RfC2.
We use here the term “result” (not “outcome” or “verdict”).
time!
is not optional. Every result has atime
, even those that are aborted or crashed. We promise that time is never compared to anything abovetle_margin * time_limit
, so any value above that is the same.Careful distinction between successful termination (of the process) and successful output validation; note the required field
output_validated!
: every submission thatterminated_successfully
has its output validated. Note that this means “the output validator accepted” (it returned42
).Scores are nonnegative
The judge message is called
message
. Alternative:judge_message
.Nothing else is part of the result. (Memory use, team message, programming language.) Maybe it should.
Is this good enough for multipass problems as well, including interactive problems? (Honest question.) What needs to be understood is wich execution times are cumulative, for instance. This is out of scope for me, but worth thinking about.
Verdict
The
#verdict
is a shorthand of the#result
. There are four verdicts:CE
,MLE
, orJE
are not verdicts. (Making them verdicts requires a richer#result
. No author submissions are ever expected to produceCE
orJE
, andMLE
is very hard to specify.)What verdicts mean
We can specify the semantics of each
#verdict
quite readably in CUE:Timing
Like
#verdict
, the#timing
is a discrete summary of the result:This discretises the value of
result.time
in terms of the three boundaries provided byproblem.limit
in the obvious fashion, we can specify this precisely in CUE to be clear about which inequalities are strict:Proposal: Expectations
The role of expectations is to
during problem development to ensure expected behaviour of author submissions on testdata as the problem definition, testdata, and author submissions are changed or added.
define which author submissions are used to set the time limit
Expectations defined
An expectation (for a submission) is defined for a set of results, primarily in terms of verdicts, timings, messages, and score ranges:
A set of results satisfies an expectation if
for each result, the result's verdict/timing is contained in
permitted_verdicts/timings
at least one of the verdicts/timings in
required_verdicts/timings
appears among the verdicts of the set of resultsthe
message
appears (as a substring) in at least oneresult.message
among the verdicts of the set of resultsevery
result.score
is insideexpectation.range
Common abbreviations
Typically a problem author will not use the fine-grained
#expectations
struct defined above, but instead use a common abbreviation:The above definitions are meant to define the expectations for submissions in
/submissions/accepted
etc. Here are two additional, useful abbreviations: