Verdict for scoring problems

thorehusfeldt commented 1 year ago

The current formulation is this:

For scoring problems, the concept of verdict is less important. In fact, some judging systems may choose to not even present a final verdict for a submission, but instead only present, for example, the score and the verdicts on the test cases the submission failed.

Do we mean this? (In particular, do we mean “score … where the submission failed” (I think not) and do we mean “test cases(!) [where] the submission failed”.

It could be that what is meant is

… show the total score, as well as the first rejected verdict in each subgroup of secret.

But I honestly don’t know.

This still leaves undefined what should be done about data/secret/group3/edge_cases/.

I am fine with a rule such as

for a problem of type scoring, for each subgroup g of data/secret, the test cases in g/** are judged as for a pass-fail problem (they are subject to a time limit and their verdict is either AC or the rejected verdict of the lexicographically first submission ing/**). For each accepted g, the score of g is [and so on]

Is this what we mean? (This is of course a significant simplification and restriction of the current specification.)

Is there a discussion about this that I’m not privy to?

jsannemo commented 1 year ago

There was some discussion on the issue/PR/slack.

I think I missed an Oxford comma on that sentence:

for example, the score, and the verdicts on the test cases the submission failed.

should be the proper reading.

Is this what we mean? (This is of course a significant simplification and restriction of the current specification.)

The point was to kind of move away from dictating specific verdicts for scoring problems. For example, a common judging model -- that which is used in CMS, probably the most widely used IOI-style judge -- is to judge all test cases, and show you a list of all test cases with their verdicts & time usage. Before removing graders, this wasn't something that could easily be left up to judging systems or contest administrators; their semantics required judging to be performed in a specific order, with verdicts computed in a fixed way. But now, there's very little in the verification of the problem itself that is forcing in terms of a specific verdict, except by coincidence for accepted and wrong_answer submissions.

Note that the only place we mandate anything about how exactly verdicts are computed is for pass-fail under Grading. This isn't actually used anywhere in the format though, and in practice judging systems can (and probably will) do what they want. I personally lean slightly towards removing that sentence too, but don't feel strongly enough about it to do anything to that end.

thorehusfeldt commented 1 year ago

But certainly a judge for type:scoring is expected to judge every subgroups of data/secret?

Or do you find that it’s up to the judge to abort after a rejection on secret/group1 even though secret/group2 exists (and all testcases in secret/group2.** would have been accepted)?

Honest question; I don’t understand how drastically underspecified you want to leave this.

(If so, I disagree.)

thorehusfeldt commented 1 year ago

nd the verdicts on the test cases the submission failed.

You really mean test cases here, not test groups ? (Or, in fact, subtasks ?)

(I remain terminologically challenged, but I believe subtask roughly means “child testgroup of /secret.)

jsannemo commented 1 year ago

But certainly a judge for type:scoring is expected to judge every subgroups of data/secret?

Yes, that is how the score of data/secret is computed.

Honest question; I don’t understand how drastically underspecified you want to leave this.

Could you explain what exactly you think is underspecified? The intent was to keep the semantics of the score exactly the same, but drop requirements on what the "verdict" is. Do you think that how the score of a group is underspecified too?

You really mean test cases here, not test groups ? (Or, in fact, subtasks ?)

Yes, that is indeed the luxury we were afforded with at IOI. Once you think about it, there's not anything that says a contest should only be allowed to show you a single failed case -- it's just a choice as arbitrary as showing you all failed cases.

jsannemo commented 1 year ago

Specifically, I refer to:

If sum, the score is the sum of the subresult scores. If min, the score is the minimum of the subresult scores.

It's not very clear what "subresult score" is though (in particular, that this is the score for the groups inside a group, and not only testcases!). That section could probably use some of your more stringent writing.

thorehusfeldt commented 1 year ago

By the way, from the perspective of expectations I very much would like it to be well-defined which verdicts various subgroups achieve in a correct implementation of the specification, also for problems of type:scoring.

I want to be able to way that greedy.py gets full points on test group 1 but fails on test group 2 with WA rather than TLE (but brute-force.py exactly fails with TLE.) And I would find that particularly useful for scoring problems. I can’t express that if “test groups in scoring problems don’t have verdicts”). [1]

And such a framework would strictly improve problem quality. If we leave these things unspecified, it’s harder to write and use precise test cases with expected failure behaviour. So I would favour to keep well-defined semantics of verdicts for test groups (and I’m fine with "first-error").

[1] I do understand that my expectations framework actually could solve this by being fine-grained-by-test-case, but I’d have to rethink the semantics.

jsannemo commented 1 year ago

I want to be able to way that greedy.py gets full points on test group 1 but fails on test group 2 with WA rather than TLE (but brute-force.py exactly fails with TLE.) And I would find that particularly useful for scoring problems. I can’t express that if “test groups in scoring problems don’t have verdicts”). [1]

I guess my point is that when you run your scoring problem on my judge, it doesn't mean anything "to fail on test group 2 with WA rather than TLE". I won't show any "verdict of test group 2" to a user or administrator: you get a score and a bunch of test case results -- take it or leave it. You would be assigning semantics in the problem to a concept that wouldn't exist when it's used. So I can't agree that it would strictly improve problem quality, since there would be no distinguishable result (in this case -- judges implementing some kind of "subtask verdict" are a different story, of course).

I'm a bit curious though what it is you would be trying to test though. This rather sounds like you're looking for: "greedy.py must get WA on some case but never TLE on any case", which doesn't have anything to do with what the subtask verdict is. In fact, a subtask verdict wouldn't be useful to you, since you might fail the group because of TLE later on -- and that's what you would want to catch, right?

thorehusfeldt commented 1 year ago

TH: But certainly a judge for type:scoring is expected to judge every subgroups of data/secret?

JS: Yes, that is how the score of data/secret is computed.

I’m fine with this, but we have to say that somewhere.

What I have a hard time with understanding right now is that the behaviour of the legacy default grader seems to be gone. Right? The following is wrong (right?)

The result of a test data group is computed by applying a grader to all of the sub-results (test cases and subgroups) in the group. See Graders for more details.

For instance, if in a pass-fail problem I have secret/huge and secret/edge-case then the judge is not required to compute sub-results for either. (right?)

Also, for scoring problems, behaviour is now even more undefined than for pass-fail, and our current specification does not (as far as I can tell) say anything like

in a scoring problem, every subgroup of secret/ is evaluated [but I think we think this is true and should be specified]

no more than it says that

in a pass-fail problem, the subgroup of secret/ is evaluated [and I think this is indeed no longer supposed to be true]

I’m simply at a loss right now. I honestly don’t understand. Maybe somebody can try to explain how grading is done (in the new spec) for scoring problems.

thorehusfeldt commented 1 year ago

Maybe you mean this:

for a problem of type scoring, for each subgroup g of data/secret, the test cases in g/ are validated in some order (subject to the same time limit). If all testcases in `g/` are accepted then [complicated stuff, but Thore understands]. Otherwise the score is 0. The score of the submission is [and so on]

This I would understand (but in that case we don’t need the grading key in testdata.yaml at all, everything follows from type:scoring.)

jsannemo commented 1 year ago

I’m fine with this, but we have to say that somewhere.

I'm trying to claim that we are, but in a very unclear manner. As I wrote in an earlier comment, the key phrase in the spec is:

If sum, the score is the sum of the subresult scores. If min, the score is the minimum of the subresult scores.

This is intended to say: "the score of a test group is set to be the sum/min of the score of all 'subresults'", where 'subresult' is intended to be interpreted as "the scores of all its test cases and test groups". This is not at all prescriptive of "how grading is done" in the sense of exactly how a judge works internally: it is not an imperative description, but a declarative one.

It is not well-described: clearly hiding this away in the definition of a key in the grading settings is not enough. It should be moved up to a higher level and described in a better way.

For instance, if in a pass-fail problem I have secret/huge and secret/edge-case then the judge is not required to compute sub-results for either. (right?)

It is indeed not required to. It wasn't required in the ICPC version of the old spec either, and DOMJudge doesn't implement it at all (since it doesn't know about test groups).

thorehusfeldt commented 1 year ago

Def: subgroup = what the spec says, i.e., a subdirectory of data.

Honest question: For type:scoring, is the score of every subgroup computed? (Say, the subgroup secret/group1/foo?)

This questions partially aims to understand what “test group” means (to you.) I constantly worry that many conversations here are needlessly long because of terminological difficulties.

Maybe you have test group == subgroup, maybe you have test group == subtask. And then maybe is is implicit that subtasks == subgroups of data/secret (by which I mean children, not successors). But I can only guess.

jsannemo commented 1 year ago

Def: subgroup = what the spec says, i.e., a subdirectory of data.

Yes.

Honest question: For type:scoring, is the score of every subgroup computed? (Say, the subgroup secret/group1/foo?)

Yes; to compute the score of subgroup secret/group1, one needs to know the scores of everything (cases&further subgroups) in the subgroup (and secret/group1/foo is one of those things).

jsannemo commented 1 year ago

I constantly worry that many conversations here are needlessly long because of terminological difficulties.

That is probably a sign that a github issue might not be the best medium for this conversation, but rather e.g. discord or slack where the latency of round-trips is much smaller. :)

niemela commented 1 year ago

The point was to kind of move away from dictating specific verdicts for scoring problems.

Do you mean moving away from how to present specific verdicts? The intent with verdicts and score is that a score only exists if the verdict is accepted. This is why the verdict is not so important for scoring. There can of course be a multitude of different verdicts further down in the test data group tree, and how (or if) exactly to present these verdicts us what should not be so overly dictated.

jsannemo commented 1 year ago

Since this seems to have become a very complicated discussion about exactly what this text says, I will try to as carefully as possible explain what I think. :) If you don't want to read it, the gist is that I want scores to always exist for groups/submissions so we can verify against them, and don't want verdicts for groups/submissions to be defined at all.

In my mind, there are three levels to writing this specification.

What the specification defines and requires from the problem in terms of being able to perform validation of itself.
What a solver in the judge observes when submitting to the problem once installed.
What a judge internally does when a it receives a submission.

I will ignore 3.

Verdict: AC (an accepted verdict), or one of TLE, RTE, WA (a rejected verdict).
Score: a numeric value.
Test group: data and any subfolder in its directory tree.

Scores: First, I'd argue that a judged submission and every test group under it has a score. This score may be 0. It is computed based on its sum/min setting. The score of the submission is defined as that of the data/ test group. If a submission fails every test case, the score is necessarily 0 (by how the judging can be configured).

This is the behavior you want for validation. If you, in Thore's expectation framework, require a.cpp: score: 0, you want that score to exist and be 0 if it fails all test cases.

What a solver sees in the judge here is completely up to the judge, I'd say. CMS clearly thinks such submissions have a score (0), shown in the submission details. Kattis is a bit more ambiguous; it will show you 0 on the scoreboard once you send in a submission that "does not receive a positive score". Is that having score 0? Is that having a non-existent score? Is tomato pronounced tomatoh?

TL;DR: As a judge, you do you, but for validation it's practical to always assign a score, no matter "accepted-ness".

Verdicts: The verdict of a test case is clear to everyone: it's WA if the validator said so, TLE if you took too long, and RTE if you crashed. This is a useful concept within validation, so it should clearly be defined to exist specification.

Does it make sense to define a verdict for a test group? There's nothing in the specification that refers to it. A judge doesn't have to assign one either. If a judge desperately wants to, it could for example use the legacy worst_error semantics. Or first_error. Some judges don't (neither internally or externally) define "test group verdicts". So: let's not.

What's left is the most biting issue: what's the verdict of a submission in a scoring problem? For a judge and a user, there are plenty of times where this doesn't have any meaning. For example, neither CMS nor my judge shows any kind of judgment for a submission on a scoring problem, only its score. On the other hand, Kattis always presents a judgement to the user. So, let's turn to the specification and validation.

Here, we actually barely define or use the "submission verdict". The accepted ("accepted on all cases")/time_limit_exceeded ("too slow on some case")/wrong_answer ("wrong answer on some case")/run_time_error ("crashes on some case") certainly don't, even though the accepted folder is equivalent to getting the Accepted verdict. The time limit is at best ambiguous, using the terms "accepted" and "time_limit_exceeded" -- probably referring to "solutions in those two folders".

In fact, we only use to the submission verdict for validation in a single place - for scoring problems, even! - in partially_accepted:

Overall verdict must be Accepted. Overall score must be less than max_score.

However, we never define what it means for a submission to be "overall verdict Accepted" for a scoring problem. A better definition of partially_accepted after the scoring simplification is that they should get non-zero score, removing the last reference to "overall verdicts". With Thore's expectation framework, the use case that actually prompted the PAC folder inclusion in the first case (excluding "too slow" submissions from timings) is also moot.

TL;DR: In practice, the specification as-is almost completely disregards the existence of any kind of "verdicts" for submissions to scoring problems. This is good: judges should get to do what they prefer.

thorehusfeldt commented 1 year ago

I’ve tried to explain what (I think that) Johan means here:

https://problem-package-format-copy.readthedocs.io/en/latest/format.html#how-scoring-problems-are-judged

thorehusfeldt commented 1 year ago

Four use cases for scoring are now explained here: https://problem-package-format-copy.readthedocs.io/en/latest/format.html#use-cases

Also, some suggestions for terminology and syntax. Also, the bash script now validates all the yaml included via the literalinclude directive. Pretty cool.

Kattis / problem-package-format

Verdict for scoring problems #110

Test group: `data` and any subfolder in its directory tree.

Kattis / problem-package-format

Verdict for scoring problems #110

Test group: data and any subfolder in its directory tree.

Test group: `data` and any subfolder in its directory tree.