Should we undo the changes to defaults in scoring?

Tagl commented 5 months ago

We made changes to defaults in each test data group based on the type of the problem. This was to prevent how much is needed to be explicitly stated within testdata.yaml. Giving the natural defaults for ICPC and IOI style problems. This does however affect how inheritance must work and in order to circumvent issues we made the scoring part of testdata.yaml not inherited, while output and input validator flags are inherited.

I personally find it a bit confusing to have some parts inherited and others not. Should we undo the changes to defaults so that inheritance is simplified to its previous state?

Tagl commented 4 months ago

Suggestion for changes

Change aggregation modes:
- pass-fail: sets the score of the test group to whatever the max score is if all verdicts of its children are accepted. Children's scores are inconsequential. This is the default setting. May short circuit in case of non-accepted verdicts on children.
- sum sets the score of the test group to the sum of the scores of its children. Must not short circuit in case of non-accepted verdicts on children.
- min sets the score of the test group to the minimum of the scores of its children. May short circuit in case of non-accepted verdicts on children.
- sum_no_errors which works the same as sum, but requires that all tests are accepted otherwise it gives first error. May short circuit in case of non-accepted verdicts on children. This resembles sum without accept_if_any_accepted from the legacy version.
Rename score to score_multiplier
Add accept_score as it used to be, the default score for a test case if score.txt is not written by the output validator. Probably a better name for it exists. This change is less important than the others.

Change the default testdata.yaml to:

scoring:
aggregation: pass-fail
accept_score: 1
score_multiplier: 1
max_score: 1
output_validator_flags: ""
input_validator_flags: ""

We need some execution policy, at least for samples, but could just as well be execution_policy on any group. Possible values are the equivalent of "don't run", "run but don't grade", and "run and grade". Run but don't grade is then usable instead of old grader flag ignore_sample and don't run is usable instead of run_samples: false. I can't really think of a case where you would use this for something else, but maybe someone else has an idea. Not in contests, but maybe for some educational purpose.

How will this affect what needs to be written?

For pass-fail, no testdata.yaml is required.

For common IOI-style subtask scoring, data needs to change aggregation to sum, sample may need to change max_score to 0 and its execution policy is run but don't grade. Secret inherits sum from data. Finally, each subgroup needs to be set to pass-fail and max_score to the desired score of the group. Total this is 1 + 2 + 2n lines to write where n is number of subtasks. Current specification requires 2n lines as well.

For heuristic tasks you will again configure data and depending on the task you may need to configure secret or sample to use a different aggregation and then also there is freedom with regards to using score_multiplier with the output validator as is convenient.

jsannemo commented 4 months ago

Nooo, don't re-complicate scoring for IOI tasks :(

Tagl commented 4 months ago

@jsannemo We are discussing this very thoroughly and some of the reasoning for the changes made are not present on the issues or pull requests. Can you elaborate why do we not need something like on_reject or verdict aggregation such as accept_if_any_accepted or first_error any more? For example: How would you set up an optimization problem where every test case within a subgroup must be "Accepted", meaning that your answer is well formed in all cases, and in that case your score is the sum of the scores of the test cases. If you do have a rejected result in some test case, then the result is rejected and a score of 0.

For standard subtask problems, you changed it such that the score multiplier is set to the value of the test group, and then the output validator returns a score of 1. This functionally does what we want but semantically it means something different. Fredrik suggests that semantically we want to be able to convey that the test group has a fixed score in some way. This is what my pass-fail aggregation handles.

Nooo, don't re-complicate scoring for IOI tasks :(

We do not want to get rid of the pragmatism in the current specification if possible. If you remember all of the arguments for the changes, it would be good to have it written up somewhere. As is, we do not like the inheritance differences in testdata.yaml and we cannot seem to get rid of it without changing the defaults in some way and making them consistent between types of problems.

jsannemo commented 4 months ago

The discussion I recall from Lund was basically:

in practice, nobody actually did scoring problems like that (based on DB queries Fredrik was doing). I think this is the foremost and best argument.
this allows IOI problem authors to only define scores for subtasks and not think about whatever an "aggregation mode" or whatever is (lowering the authoring bar)

thorehusfeldt commented 4 months ago

Let me add my own suggestion to this.

As a syntactically minimal intrusion, change the type field in problem.yaml to

type!: "scoring" | "pass-fail" |  [...int]

This allows the following:

type: "pass-fail"
--
type: "scoring"
--
type: [10, 25, 50, 15]

The interesting feature is the third. This specifies a scoring problem with 4 testgroups and the given scores, with the obvious semantics. This is easy to understand, easy to edit/maintain, and easy to programmatically verify (count the child subgroups of data/*). If this value is present in problem.yaml, no score key shall be present in testdata.yaml.

On the other hand, for problems with type scoring, the score key must be present in every testdata.yaml and specifies the aggregation rules. No defaults are specified, and not inheritance takes place.

If the overloaded type key makes you reject this, consider the following idea instead. In problem.yaml:

type!: "scoring" | "pass-fail"
if type == scoring { points?: [...int] }

It allows

type: pass-fail
--
type: scoring
---
type: scoring
points: [20,10,50,30]

but disallows

type: pass-fail
points: [20,10,50,30]

RagnarGrootKoerkamp commented 4 months ago

This last suggestion sounds quite nice to me (as a non-user of scores, so my opinion has light weight).

(I'd call it scores: instead of points: for consistency.)
It's very concise, and easier to see all scores than if they're scattered over multiple testdata.yaml files.
We can check that the number of scores matches the number of testgroups.
The only drawback is that the testgroup-score correlation is weaker now and only implies by the lexicographic ordering, instead of being a 'child'-property regardless of ordering.
If this can clean up a bunch of opaque rules on testdata.yaml that is good.

thorehusfeldt commented 4 months ago

A light-weight way of addressing the implicit association between testgroup scores and lexicographic ordering is

scoring?: [...int] | { [#testgroup]: int }

This allows

scoring: [30,50,10,20]
---
scoring: # the same, just different YAML syntax
  - 30
  - 50
  - 10
  - 20
--
scoring: # specify by testgroup name
  group1: 30
  group2: 50
  group3: 10
  group4: 20
--
scoring: # names can be anything
  small: 30
  connected: 50
  greedy: 10
  full: 20

Even this, richer, syntax, is a breeze to specify, read, edit, and verify. (Tiny issue with auto-numbering of testgroups, can be solved by implicit globbing patterns or postfix matches. I wouldn’t go there, though.)

simonlindholm commented 4 months ago

Test groups having arbitrary names and then getting sorted in lexicographical order is something that has bit us in the past. E.g. I've seen problem authors specify in the problem statement that there are two groups "n <= 3000", "n <= 1e5", name them "quadratic" and "full", and then have them show up in the wrong order in the Kattis UI. Or, having ten groups named group1, ... group10. The "names can be anything" case in the comment above has this bug also.

This doesn't necessarily have to impact this feature, but it's worth keeping in mind. E.g., if one were to spec that groups had to be named numerically in an attempt to disarm that footgun, that would make the list syntax more attractive.

niemela commented 4 months ago

I've seen problem authors specify in the problem statement that there are two groups "n <= 3000", "n <= 1e5", name them "quadratic" and "full", and then have them show up in the wrong order in the Kattis UI.

How did these imagine that the system would know the "correct" order, since it's not explicitly specified (in any other way than the lexicographical order). I mean, as a human it's "obvious" that you want "full" last, but... surely that's not what they expected the system to realize?

I see two main ways around this "footgun":

Have some other explicit (and mandatory) way that specified the order of test cases. A simple way to to this would be to have the scoring key in testdata.yaml that @thorehusfeldt suggested be a sequence. Leaving out a test case from the sequence could either be an error or a warning, but if it's not listed it would not be included. An annoying detail is that this would now also be needed or pass-fail problems (so scoring wouldn't be a great name anymore).
Make the (strongly?) suggested naming scheme mandatory. I..e the name of every test file must be <N>-<name> where <N> is a zero padded number that is the same length and different for all test cases in the same directory, and <name> is different for all test cases in the same directory. Then we can (stronlgy?) suggest that an UI actually omits the <N>- from the test case name when showing it.

Either of these options would remove the footgun, because if you make incorrect assumptions about how it works, the validation will give you a warning or error. Option 1 feels "cleaner", but option 2 would be less work. Basically, if you already follow the current recommendation (i.e. the "best practice") you don't have to do anything extra, it's just that you will no longer be allowed to name your testcases "quadratic" and "full", they must be (e.g.) "1-quadratic" and "2-full".

This is what option 1 would look like (let's rename the key to "testcases", but feel free to improve on this):

# Scoring problems
testcases:
  - simple: 20
  - quadratic: 30
  - full: 50
---
# Pass-fail problems
testcases:
  - simple
  - quadratic
  - full

@simonlindholm What do you think?

thorehusfeldt commented 4 months ago

For completeness, let me also mention that I expected most authors to continue to use some kind of automated testcase generation framework, which will also set “subtask scores”. (Whether it is testdata_tools or the generators specification.)

For instance, the file generators/generator.yaml would look like this

#...
data:
  secret:
    data:
    - small: # The name of the testgroup, which is a subtask in the IOI tradition; it will become auto-numbered to 1-small
       testdata.yaml:
         scoring:
            score 20 # Here is where the subtask score will actually be set
       data:
         sm-tree: generate_tree -n 10
         sm-empty:
            in: 0 0
         # and so on

niemela commented 4 months ago

For completeness, let me also mention that I expected most authors to continue to use some kind of automated testcase generation framework

I'm fairly certain that "most authors" don't use such framework. Maybe "most problems" are made with them, but I don't think even that's quite true.

In any case, I don't think we should make it too hard to use the spec without frameworks, and whatever we choose (within reason) will be easy with a framework.

simonlindholm commented 4 months ago

How did these imagine that the system would know the "correct" order, since it's not explicitly specified (in any other way than the lexicographical order). I mean, as a human it's "obvious" that you want "full" last, but... surely that's not what they expected the system to realize?

I think they used testdata_tools, so in those cases it was definitely on us for not making that auto-number test groups to get rid of the foot gun.

I don't really have an opinion on how to resolve this, just bringing it up so we don't accidentally enshrine a foot gun in the spec.

niemela commented 4 months ago

I think they used testdata_tools, so in those cases it was definitely on us for not making that auto-number test groups to get rid of the foot gun.

Oh, ok, that actually makes a lot of sense. How do you specify the order of test cases in testdata_tools?

simonlindholm commented 4 months ago

For test cases, top to bottom in generator.sh. For test groups, conventionally also top to bottom and always named group1, group2, ...

niemela commented 2 weeks ago

I think this is moot now. We didn't undo it exactly, but we redid it in a way fairly much in line with the discussions in this thread.

Closing.

Kattis / problem-package-format

Should we undo the changes to defaults in scoring? #157

Suggestion for changes

How will this affect what needs to be written?