Closed thorehusfeldt closed 11 months ago
Harping on about the bespoke syntax, the suggestion is orthogonal to the rest of the proposal if we just view it as a shorthand.
mixed/th.py: not wrong
is just a shorthand for
mixed/th.py:
allowed: [AC, TLE, RTE]
and
mixed/th.py: time out
is a shorthand for
mixed/th.py:
allowed: [AC, TLE]
required: TLE
This sounds useful to me (and easy to implement and describe).
For scoring, I definitely like the idea of full
, or more generally, ranges supporting percentages.
I think the cases where I would use something other than 100% and 0% are rare, but I would use it, for example, in tasks which score heuristic algorithms. The algorithms could even be non-deterministic which makes exact score expectations unreliable.
For judgemessage.txt
we would have to support regex or something like it, not just a pure string matching. For example, the contents may include formatted messages describing the index which caused the wrong answer verdict.
Sidenotes:
allowed
in the section Required and allowedaccepted
should be allowed
?After some more feedback, here’s another iteration (0.3.0).
First, some examples:
# Simple examples for some common cases
a.py: accepted # AC on all cases
b.py: wrong answer # at least one WA, otherwise AC
c.py: time limit exceeded # at least one TLE, otherwise AC
d.py: runtime exception # at least one RTE, otherwise AC
e.py: does not terminate # at least one RTE or TLE, but no WA
f.py: not accepted # at least one RTE, TLE, or WA
g.py: full score # gets max_score
# Other common cases can be easily added to the specification,
# tell me if one comes to mind.
#
# Abbreviations are just shorthands for richer maps
# of "required" and "allowed" keys.
#
# For instance, these are the same:
th.py: accepted
---
th.py:
allowed: AC
required: AC
---
# A submission failed by the output validator on some testcase
# These are the same:
wrong.py: wrong answer
---
wrong.py:
allowed: [WA, AC]
required: WA
---
wrong.py:
allowed: # altenative yaml syntax for list of strings
- WA
- AC
required: WA
---
# Specify that the submission fails, but passes the samples.
# These are the same, using the same abbreviations as
# above for "accepted" and "wrong answer"
wrong.py:
sample: accepted
secret: wrong answer
---
wrong.py:
sample:
allowed: AC
required: AC
secret:
allowed: [AC, WA]
required: WA
# Specification for subgroups works "all the way down"
# though it's seldomly needed:
funky.cpp:
secret:
subgroups:
huge_instances:
subgroups:
disconnected_graph:
allowed: [RTE, TLE]
required: TLE
---
# Same as:
funky.cpp:
secret:
subgroups:
huge_instances/disconnected_graph:
allowed: [RTE, TLE]
required: TLE
# Can also specify a required judgemessage, not only verdicts
linear_search.py:
judge_message: "too many rounds" # matches judgemessage.txt as substring, case-insensitive
#########
# Scoring
#########
# simplest case:
th.py: full score
# Partial solutions can be given in various ways:
partial.py: [50, 60]
---
partial.py: 60
---
partial.py:
score: 60
---
partial.py:
score: [50, 60]
---
partial.py:
fractional_score: [.5, .6] # percentage of full score
---
# For subtasks, you probably want to specify the
# outcome per subgroup. You need to be more verbose:
partial.py:
secret:
subgroups:
subtask1: full score
subtask2: 0
subtask3: 0
---
# Can be even more verbose about scores
partial.py:
secret:
subgroups:
subtask1: full score
subtask2:
score: 13 # absolute score on this group is exactly 13
subtask3: # between 10% and 40% of (full score for subtask 3)
fractional_score: [.1, .4]
---
# Can still specify testcases
bruteforce.py:
secret:
subgroups:
subtask1: full score # subtask 1 has small instances
subtask2:
score: 0 # No points for this subtask
required: TLE # ... because some testcase timed out
allowed: [AC, TLE] # ... rather than any WAs
---
# The common abbreviations work here as well, you probably want to write this instead:
bruteforce.py:
secret:
subgroups:
subtask1: full score # could write "accepted" as well in this case
subtask2: time limit exceeded # this is more informative than "0"
And here’s the schema:
#registry
#registry: close({ [string]: #root })
#test_case_verdict: "AC" | "WA" | "RTE" | "TLE"
#root: #common | #range | {
#expectations
sample?: #subgroup
secret?: #subgroup
}
#subgroup: #common | #range | {
#expectations
subgroups: close({ [string]: #subgroup })
}
#expectations: {
allowed?: // only these verdicts may appear
#test_case_verdict | [...#test_case_verdict] | *["AC", "WA", "RTE", "TLE"]
required?: // at least one of these verdicts must appear
#test_case_verdict | [...#test_case_verdict]
judge_message?: string // this judgemessage must appear
score?: #range
fractional_score?: #fractional_range
}
#common: "accepted" | // { allowed: AC; required: AC }
"wrong answer" | // { allowed: [AC, WA]; required: WA }
"time limit exceeded" | // { allowed: [AC, TLE]; required: TLE }
"runtime exception" | // { allowed: [AC, RTE]; required: RTE }
"does not terminate" | // { allowed: [AC, TLE, RTE}; required: [RTE, TLE]
"not accepted" | // { required: [RTE, TLE, WA] }
"full score" // { fractional_score: 1.0 }
#range: number | [number, number]
#fractional_range: #fraction | [#fraction, #fraction]
#fraction: >= 0.0 | <= 1.0 | float
I don't have any strong opinions on the scoring part, but the verdict specification sounds fine to me. I do think it's important to have the list of common expectations clearly documented and I'd strongly prefer that we then can use directories named after these to organize the submissions in; I realize that this is a not really part of this proposal, but good to keep in mind.
Thank you, Jaap!
Clear documentation is exactly what I’m after.
The organisation into directories is orthogonal to the expectations proposal, but consistent with it.
In particular, I would like to allow globbing in submission pathnames. This means that the directory definitions in the legacy specification are the same as the expectations.yaml
file
# Legacy specification expectations implicit in directory structure
accepted/: accepted
wrong_answer/: wrong answer
time_limit_exceeded/:
allowed: [AC, WA, TLE]
required: TLE
runtime_error/:
required: RTE
(And as far as I understand, these expectations are not consistently implemented, nor do we agree that that they are what we want. My proposal should help us be much more clear about that.)
I’d be happy to include “default expectations for unlisted submissions”, but I’m not sure it’s part of the expectations proposal. (I’m willing to make it so, but as we observed, current judge systems already pursue different traditions about what the directories mean. My main ambitions is to facilitate the specification of expectations, rather than mandate requirements on implementations.)
Note that the globbing idea also allows Thore to promise that all his submissions pass the samples:
**/*-th.py:
sample: accepted
Similarly, a head of jury could even require that everybody’s WA submissions pass all samples:
wrong_answer/*:
sample: accepted
But globbing is science-fiction for now.
Version 0.4.0
Changes since 0.3.0: flattened the hierarchy of subgroups to be just strings. Reason: simpler to read, write, and specify. Far less verbose in 99.9% of cases. Also added regular expression to allow for auto-numbered groups and testcases.
First, some examples:
# Simple examples for some common cases
a.py: accepted # AC on all cases
b.py: wrong answer # at least one WA, otherwise AC
c.py: time limit exceeded # at least one TLE, otherwise AC
d.py: runtime exception # at least one RTE, otherwise AC
e.py: does not terminate # at least one RTE or TLE, but no WA
f.py: not accepted # at least one RTE, TLE, or WA
g.py: full score # gets max_score
# submission are identified by prefix:
wrong_answer/: wrong answer # expectations "wrong answer" apply to "wrong_answer/th.py" etc.
# Abbreviations are just shorthands for richer maps
# of "required" and "allowed" keys.
#
# For instance, these are the same:
th.py: accepted
---
th.py:
allowed: AC
required: AC
---
# A submission failed by the output validator on some testcase
# These are the same:
wrong.py: wrong answer
---
wrong.py:
allowed: [WA, AC]
required: WA
---
wrong.py:
allowed: # altenative yaml syntax for list of strings
- WA
- AC
required: WA
---
# Specify that the submission fails, but passes the samples.
# These are the same, using the same abbreviations as
# above for "accepted" and "wrong answer"
wrong.py:
sample: accepted
secret: wrong answer
---
wrong.py:
sample:
allowed: AC
required: AC
secret:
allowed: [AC, WA]
required: WA
# Constraints apply to testcases in entire subtree of cases that match the string:
funky.cpp:
allowed: [AC, WA, RTE]
secret:
allowed: [AC, RTE, TLE] # TLE is forbidden at ancestor, so this makes no sense
secret/small: accepted # more restrictive than ancestor: this is fine
# Specification for subgroups works "all the way down to the tescase"
# though it's seldomly needed:
funky.cpp:
secret/huge_instances/disconnected_graph:
allowed: [RTE, TLE]
# Can also specify a required judgemessage, not only verdicts
linear_search.py:
judge_message: "too many rounds" # matches judgemessage.txt as substring, case-insensitive
# Allow digit regex to catch auto-numbered groups: `\d+`
submission:py
secret/\d+-group/: accepted # matches 02-group
#########
# Scoring
#########
# simplest case:
th.py: full score
# Partial solutions can be given in various ways:
partial.py: [50, 60]
---
partial.py: 60
---
partial.py:
score: 60
---
partial.py:
score: [50, 60]
---
partial.py:
fractional_score: [.5, .6] # percentage of full score
---
# For subtasks, you probably want to specify the
# outcome per subgroup. You need to be more verbose:
partial.py:
secret/subtask1: full score
secret/subtask2: 0
secret/subtask3: 0
---
# Can be even more verbose about scores
partial.py:
secret/subtask1: full score
secret/subtask2:
score: 13 # absolute score on this group is exactly 13
secret/subtask3: # between 10% and 40% of (full score for subtask 3)
fractional_score: [.1, .4]
---
# Can still specify testcases
bruteforce.py:
secret/subtask1: full score # subtask 1 has small instances
secret/subtask2:
score: 0 # No points for this subtask
required: TLE # ... because some testcase timed out
allowed: [AC, TLE] # ... rather than any WAs
---
# The common abbreviations work here as well, you probably want to write this instead:
bruteforce.py:
secret/subtask1: full score # could write "accepted" as well in this case
secret/subtask2: time limit exceeded # this is more informative than "0"
And here’s the schema:
#registry
#registry: close({ [string]: #root })
#test_case_verdict: "AC" | "WA" | "RTE" | "TLE"
#root: #common | #range | {
#expectations
[=~"^(sample|secret)"]: #expectations
}
#expectations: {
#common
#range
allowed?: // only these verdicts may appear
#test_case_verdict | [...#test_case_verdict]
required?: // at least one of these verdicts must appear
#test_case_verdict | [...#test_case_verdict]
judge_message?: string // this judgemessage must appear
score?: #range
fractional_score?: #fractional_range
}
#common: "accepted" | // { allowed: AC; required: AC }
"wrong answer" | // { allowed: [AC, WA]; required: WA }
"time limit exceeded" | // { allowed: [AC, TLE]; required: TLE }
"runtime exception" | // { allowed: [AC, RTE]; required: RTE }
"does not terminate" | // { allowed: [AC, TLE, RTE}; required: [RTE, TLE]
"not accepted" | // { required: [RTE, TLE, WA] }
"full score" // { fractional_score: 1.0 }
#range: number | [number, number]
#fractional_range: #fraction | [#fraction, #fraction]
#fraction: >= 0.0 | <= 1.0 | float
Draft implementation of 0.4.0 is now underway at https://github.com/thorehusfeldt/BAPCtools/blob/feat/expectations/bin/expectations.py , with quite thorough comments and python doctest. Eventually, I hope to merge this into https://github.com/RagnarGrootKoerkamp/BAPCtools …
@thorehusfeldt I have a few questions...
Could you explain what you mean by "the verdict of a test group is no longer a meaningful concept."?
Is the meaning of allowed
that all verdicts must be in the specified set, and required
that at least some verdict must be in the specified set? (I'm pretty sure the answer is yes)
What happens if a test data directory is names "allowed", "required", "score", "judge_message", or "fractional_score"?
Are these the same?:
partial.py:
secret/subtask2:
score: 13
---
partial.py:
secret/subtask2: 13
Are globs allowed for both file name paths and test data paths?
Can you specify a top level expectation, and also a group expectation on the same file? What does that look like?
Could you elaborate a bit on why fractional_score
is worth inclusion? It seems to me that you could always do without, and it's not even that much simpler, but I'm sure I'm missing some use case.
You say that this is orthogonal to the current directory structure, but what is your thinking about how that should work together? I can see 3 obvious options:
Thank you so much for your comments, @niemela ; good questions all. I’ll respond in detail in a moment here, but many of the issues are addressed in the documentation of the pull request at BAPC-tools: https://github.com/thorehusfeldt/BAPCtools/blob/feat/expectations/doc/expectations.md
With the changes to “how grading works” suggested mainly by @jsannemo , there is now no longer “a verdict for a test group”, in particular for scoring problems. Scoring problems have a “score for a testgroup”. (In legacy
, there is such a concept, and September 2023 problemtools
implements it.) Tools like DOMjudge apparently never had this concept; only the entire set of test data receives a “final verdict” (called the “outcome”.) I signal no opinion here on how this should be, I merely observe that “the verdict of a test group” is not a socially useful concept.
Yes, exactly. Note that the framework does not offer a way of saying “all of TLE, RTE need to appear”. It’s trivial to add, but I don’t see a use case for it, so I haven’t added it. That’s exactly the kind of decision I want feedback on.
This can never become a problem. Test data directory paths start with sample
or secret
, and therefore the test data pattern must start with those words. Ensured in the schema here:
[=~"^(sample|secret)"]: #expectations
So you could have a test data directory whose full path name is <problempath>/data/secret/required/
if you really wanted to do something silly, but you still need to refer to it as, say, secret/required
or a prefix of that; never as required
.
Yes. But note that the syntax for scoring problems is at best temporary right now (because the specification for scoring problems is very much in flux.)
Great question! I distinguish between submission patterns and test data patterns. Here are some submission patterns:
accepted
accepted/
accepted/th.py
mixed/fredrik.cpp
mixed/greedy-
They match as prefixes, in python str.startswith
.
Here are some typical test data patterns:
sample: # match data/sample/*
secret/huge : # match both case secret/huge-instance-934 and group secret/huge/023-graph/
secret/group1/ : # match subtask data/sample/group1
secret/\d+-monotone: # match secret/034-monotone
secret/(?!greedy) : # match all testcase not starting with secret/greedy
Test case matches are as regexen from the start of the string, using python’s re.match
.
In 99% of the cases for pass/fail problems, you’ll only ever specify one-string submission patterns like accepted
or maybe accepted/
, and you’ll only ever specify all testcases, maybe with particular attention to sample
in some traditions. For scoring problems, you’d use secret/group1/
(or maybe secret/\d+-greedy/
) a lot.
Yes, you can. The specification has no opinion on where the expectations are specified, but you can do it in expectations.yaml
:
wrong_answer/: wrong answer # this is the default, teh
wrong_answer/th:
sample: accepted # Thore promises that all his submissions get accepted on sample
wrong_answer/th-greedy: # Also promises that th-greedy.py...
secret/\d+greedy-killer: [WA] # ... get exactly WA on that specific test case
The registry assembles all the matching prefixes; in my draft implementation this is the method Registy.for_path(submission_path)
: https://github.com/thorehusfeldt/BAPCtools/blob/b4d4df92e9b7bce2e1ccadff872168e6694a4b58/bin/expectations.py#L294
In the pydoc comments you can see a complete and concrete example for how that could work in a concrete implementation.
I just try to be consistent with the direction that the 2023-07
draft goes in, mainly following @jsannemo ’s ideas. Scoring problems are in flux right now; my current implementation at BAPCtools (which only does pass-fail anyway) does not touch this issue. These conversations are better had after the problem package specification converges.
All of your options work; I have no strong opinions either way and don’t want it to get in they way of specifying the expectations framework.
I would assume that a tool would at least warn about any unexpected behaviour, such as an author violating the meaning of those submission directories that are specified in the problem package specification. Or inconsistencies about “what the yaml says” and “what the @EXPECTED_RESULTS@ say” and “which directory the submission is in”. If I were to decide I’d go with
expectations.yaml
file with minimal contents (exactly four lines, one for each directory in the problem package specification). expectations.yaml
.That’s very easy and clean to specify and implement. But: no strong opinion; my ambition is to allow us to specify expectations, not to decide how various traditions or communities or tools should behave.
(@thorehusfeldt Thanks for the numbering, I should have done that.)
I disagree somewhat with your claims/reasoning, but now I understand what you mean, and the disagreement is not relevant for this question. So, good enough.
Good. I also don't see a natural use case for "require all of these to appear". That said the way I have understood the expectations are that they are completely "additive", so wouldn't the following (if it was legal YAML) actually mean "require both of WA and AC"?:
wrong.py:
allowed: WA
allowed: AC
That said.... this would be legal (but very ugly), and mean the same thing:
wrong.p:
allowed: WA
wrong.py:
allowed: AC
Right?
Ah, that makes sense, and solves the problem.
Ok. I don't think it that much in flux anymore, and we really should nail this down as well.
So, if I understand you correctly; submission patterns are only (and always) prefixes, test data patterns are always prefixes, but also regexps?
Follow up question is, why do we feel that test data patterns needs to be more powerful? Why is it not enough to have them also be prefixes only?
Ok. My current feeling is that fractional_score
is not that useful. I would lean towards leaving it out for now.
Ok. I believe you stated (slight) preference is my option 3. My slight preference would be option 2.
But fractional score reduces work in case you reassign subtask scores. That's why I would use fractional score over score.
On Mon, Sep 18, 2023, 12:52 Fredrik Niemelä @.***> wrote:
@.*** https://github.com/thorehusfeldt Thanks for the numbering, I should have done that.)
1.
I disagree somewhat with your claims/reasoning, but now I understand what you mean, and the disagreement is not relevant for this question. So, good enough. 2.
Good. I also don't see a natural use case for "require all of these to appear". That said the way I have understood the expectations are that they are completely "additive", so wouldn't the following (if it was legal YAML) actually mean "require both of WA and AC"?:
wrong.py: allowed: WA allowed: AC
That said.... this would be legal (but very ugly), and mean the same thing:
wrong.p: allowed: WA wrong.py: allowed: AC
Right?
1.
Ah, that makes sense, and solves the problem. 2.
Ok. I don't think it that much in flux anymore, and we really should nail this down as well. 3.
So, if I understand you correctly; submission patterns are only (and always) prefixes, test data patterns are always prefixes, but also regexps?
Follow up question is, why do we feel that test data patterns needs to be more powerful? Why is it not enough to have them also be prefixes only?
1.
Ok. My current feeling is that fractional_score is not that useful. I would lean towards leaving it out for now. 2.
Ok. I believe you stated (slight) preference is my option 3. My slight preference would be option 2.
— Reply to this email directly, view it on GitHub https://github.com/Kattis/problem-package-format/issues/112#issuecomment-1723346572, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2ZBKT6VOUQ7GSUBZPEKCDX3A72PANCNFSM6AAAAAA3G273FM . You are receiving this because you commented.Message ID: @.***>
I went through @niemela ’s question of the phone with him, but let me repeat point 5 here:
I want to be able to match testcase secret/053-all-points-on-a-line
as well as the testgroup secret/1-tiny/
, even after these two things have changed their numbering. This is relevant for those generation tools (like BAPCtools generators or Kodsport testdata tools) that automatically generate numbers for testcases or testdata groups. The problem author thinks of these as “the testcase I called all-points-on-a-line
”, or “the testgroup with tiny inputs”, not testcase 053
etc.
One could argue for a much more restrictive subset of regexen (such as \d+
and -?
, but not much else) but it’s harder to specify and implement. As a bonus, I get so refer to “all test cases not starting with foo
” as secret/(?!foo)
, for which I can see use cases, like preparing a faulty submission that should work always, except for testcases with…
To summarise, the two different ways of matching are very easily expressed in Python:
submissionpath.startswith(submissionpattern)
re.match(testdatapath, testdatapattern)
I'm not really a fan of the inconsistent matching rules for submissions (prefix) vs testcases (regex).
Personally I think (an equivalent of) simple *
-globs would be necessary and sufficient for me in both cases. This allows
accepted/*-cubic.cpp
secret/*-slow
I have mixed feelings about matching against prefixes for submissions:
accepted/
. [Should this end in a /
?]accepted/ragnar
feels very implicit/incomplete. Doing accepted/ragnar*
feels nicer and clearer to me.So for submissions we match against the full file (/program) name, while for testcases we skip the .in
. (I'm OK with that.)
Thank you, Ragnar! I have no strong feelings about “how to match” and am more than willing to have this go in any direction.
secret/(?!huge)
, which I would find super useful. (See the my screenshot at BAPCtools pull request 307 for a concrete case.) They are both more expressive (I can target things that I can’t get at with globs) but also more restrictive (\d+
is less permissive than *
). I remain mildly in favour of them. On the other hand, a strong argument for globs is that their semantics are easy, and that we can syntax-check them. If you dislike accepted/ragnar
then you are free to write accepted/ragnar.*
to make your intention clear to readers who share your aesthetics. But note that in 90% of the cases this is not an issue, whereas I expect to see many accepted
and maybe accepted/
(nice and clean) that would have to be accepted/*
(clear but busy) if we go with globbing.accepted/th.py
is not accepted/th.cpp
) but test case paths not having it. (A test case is not its input file; in particular there are interactive test cases that don’t even have input files.) So a test case is identifed by secret/huge/054-random
.One more thing: Unlike many other decisions, globbing vs. regexen are not compatible. This is in contrast to deciding that submission patterns match by prefix but later extend the framework to instead match by regex when we get some experience and feel there’s a need for it. That extension would be backwards compatible, accepted/
would still match accepted/ragnar.cpp
. But if we go for globbing now and later decide for regexen then this would not be backwards compatible; accepted/*
would no longer match accepted/ragnar.cpp
. So glob vs. regex is more important to get right from the start than prefix vs. regex.
I'm not really a fan of the inconsistent matching rules for submissions (prefix) vs testcases (regex).
But it's prefix for submissions and prefix with regex for test cases. Right?
1. Personally I think (an equivalent of) simple `*`-globs would be necessary and sufficient for me in both cases. This allows
I agree that it would be sufficient for both. I don't see how it's necessary for submission.
* `accepted/*-cubic.cpp`
I absolutely don't think this is needed. You very seldom have more than a handful submissions anyway.
I'm not even sure I think would be a good thing. What rule were you imagining to apply this this hypothetical set of files? Allowing it to TLE? That wouldn't work with the assumed rule on accepted
.
* `secret/*-slow`
Yes, this is the example that convinced me that something more than only prefix matching is needed for test cases.
2. I have mixed feelings about matching against prefixes for submissions: * You definitely want to specify directories like `accepted/`. [Should this end in a `/`?]
In the current suggestion, as I understand it, either would work.
* Specifying `accepted/ragnar` feels very implicit/incomplete. Doing `accepted/ragnar*` feels nicer and clearer to me.
So you would suggest *
-globs on the full string, and no prefix matching, right? I think that would work.
3. So for submissions we match against the full file (/program) name, while for testcases we skip the `.in`. (I'm OK with that.)
Indeed, accepted/*-cubic.cpp
isn't very much needed. I think it's nice to have but won't insist on it.
I think the full string vs. prefix-only may depend a bit on whether we go with regex or globs. secret/slow*
glob to match full strings is fine to me, but secret/slow.*
as a regex is a bit ugly, and also confusing since this is different from secret/slow..*
:thinking:
Thank you all for giving this serious thought!
Wrt globbing, keep in mind common cases, like
accepted: accepted
wrong_answer: wrong answer
...
compared to
accepted/*: accepted
wrong_answer/*: wrong answer
...
or even
wrong_answer:
sample: accepted
compared to
wrong_answer/*:
sample/*: accepted
My ambition is to keep the common cases clean and unencumbered, while making advanced situations possible. Python’s re.match(pattern, string)
does exactly that: it’s simultaneously light-weight and powerful. Globbing is slighly worse on both counts.
Closing since there are newer proposals of this being discussed.
Background: I presented an expectations proposal in Lund in July and received a lot of useful feedback. Moreover, the definition of test groups, grading, and scoring have evolved a lot, so that much of the framework is now obsolete. In particular,
I’ll try to summarise a new proposal here, hopefully in line with the working group’s preferences.
To solidify terminology, testgroup here means a directory in
data
(includingdata
itself).data
is a testgroup, so isdata/sample
, the directorydata/secret/huge_instances/overflow/
in a pass-fail task, and the subtaskdata/secret/group1
in an IOI-style problem.Expectations are for testcases, not testgroups
The conceptually largest change is that expectations are specified for (sets of) testcases, not for testgroups. In particular, an expectation is given for
path
.path/**
.For instance, in a pass-fail problem, you can write
Required and allowed
As far as I can tell, we need to be able to specify both required and allowed testcase verdicts. The above syntax seems less verbose than the alternative:The difference becomes particular striking in IOI-style problems with subtasks. (Try it.)
I have no strong feelings about the names of keys, but
required
andallowed
seem clear to me. The semantics is that if $V$ is the set of verdicts for the testcases below the specified testgroup thenR\subseteq V\subseteq A$R\cap V\neq\emptyset$ and $V\subseteq A$. I played around withnone
,any
,all
, but it didn’t become clearer or shorter. Suggestions are welcome (but try them out first by actually writing resulting YAML expressions.)[Update: better syntax, see two posts down]
# Useful shorthands:is a shorthand for
which is the most important usecase. Also,
string
is a shorthand for the singleton[string]
.Full schema
The schema is something like this, if this makes sense to you
With subtasks
The most important use case for me is to specify expected behaviour on subtasks. This becomes less natural than in my original proposal (where the concept “testgroup verdict” existed.)
Now we’re at:
This is quite verbose, but I can’t find a way to make it shorter. Feel free to try.
Scoring
Currently I’m at
full
is important because I don’t want to remember on the values intestdata.yaml
when the score for subsask 1 changes; the valuefull
communicates more to the reader than23
.Q1: Should we instead have a fraction here, such as
score: 1.0
meaningfull
andscore: [.2, .45]
meaning “this gets between 20% and 45% of the full value for this subtask? This sounds more useful to me.Judgemessages
I want to allow judgemessages as well, which doesn’t change the schema (just add
| string
to#verdict
):I think this will make it much easier to construct custom validators (because you can check for full code coverage in your validator.)
Toplevel group name
Consider
This is (as far as I can tell) the best way of specifying “this is a WA submission that passes on sample”. But the role of this example is to highlight the fact that the toplevel directory doesn’t have a good name.
Q2: what should be done about this?
""
or maybe"."
are perfectly fine names fordata
when you actually need them. (Which is seldom, mostly it follows from the descendant verdicts anyway so you’re just being sloppy.)data/
to all testgroup names, so it’sdata/sample
etc. from now ondata
meansdata
andsample
meansdata/sample
. If authors have bothdata/secret/foo
anddata/secret/baz/foo
then they have themselves to blameBespoke Verdict Syntax Would Get Rid of Lists and Required / Allowed
An alternative would be to not have the
required
andallowed
keys and instead bake in the expected behaviour into the terminology. After all, there is only a constant number of $R$ and $E$ with $R\subseteq A$ that can every appear since $|A|\leq 4$. For instanceaccepted
means “must get exactly on all test cases”, buttimeout
means “AC and TLE are allowed, and TLE is required”,not_wrong
means thatWA
is disallowed (everything else is OK). I guess there are at best 10 different actually-existing cases that ever need to be defined.This would allow some very useful shorthands.
Q3: Is this sufficiently tempting to try to come up with a list of those cases, and think about good names? Please comment.