Kattis / problem-package-format

Kattis problem package format specification
https://www.kattis.com/problem-package-format
9 stars 14 forks source link

Expectations 0.6: Framework for specifying expectations in terms of timings and verdicts #135

Closed thorehusfeldt closed 9 months ago

thorehusfeldt commented 9 months ago

This is version 0.6 of the expectations proposal.

Goals

Primary Design Goals

  1. Precisely specify the meaning of “verdict” and “expectation”

  2. Define the semantics for setting time limit within “safety” margins from judge submissions

  3. Provide clear semantics for expectations of author submissions placed subdirectories of /submissions/*

Secondary Design Goals

  1. Allow bespoke expectation profiles beyond "accepted", "wrong answer", etc.

  2. Expectations for judge messages (“on some testcase, the submissions produces guess out of bounds”) so that we can specify full test coverage for custom output validators

  3. Expectations for scores (”this submission gets at most 65 points on all testcases”)

  4. Allow expectations to be specified for test groups like sample or secret/group1 or even secret/03-petersen_graph

  5. Provide a schema for expectations specified in a separate expectations.yaml file

Constraints

  1. Stay close to existing terminology and traditions

Out of scope:

Givens

Relevant metadata is given in problem.yaml specified as follows

#problem {
    type: "pass-fail" | "scoring"
    limits?: {
        time_limit: number
        time_multipliers: {
            ac_to_time_limit: number
            time_limit_to_tle: number
        }   
        ...
    }
    ...
}

For readability, the present document introduces aliases to the metdata specified in #problem:

let time_limit = problem.limits.time_limit
let ac_margin = problem.limits.time_multipliers.ac_to_time_limit
let tle_margin = problem.limits.time_multipliers.time_limit_to_tle

Proposal: Verdicts and timings

Result

For precision, we define the result of running a submission on a testcase for a single-pass problem.

// The result of running a submission on a testcase
#result: {
    time!: >=0
    if time <= time_limit {
        // When the submission terminates within the time limit
        // its exit status is either a success or a failure:
        terminated_successfully!: bool
        if terminated_successfully {
            // The output successfully terminated then the submissions is
            // validated by the output validator, the resulting values are:
            output_validated!: bool
            message?:          string
            score?:            >=0
        }
    }
}

Decisions made here, this is RfC2.

  1. We use here the term “result” (not “outcome” or “verdict”).

  2. time! is not optional. Every result has a time, even those that are aborted or crashed. We promise that time is never compared to anything above tle_margin * time_limit, so any value above that is the same.

  3. Careful distinction between successful termination (of the process) and successful output validation; note the required field output_validated!: every submission that terminated_successfully has its output validated. Note that this means “the output validator accepted” (it returned 42).

  4. Scores are nonnegative

  5. The judge message is called message. Alternative: judge_message.

  6. Nothing else is part of the result. (Memory use, team message, programming language.) Maybe it should.

  7. Is this good enough for multipass problems as well, including interactive problems? (Honest question.) What needs to be understood is wich execution times are cumulative, for instance. This is out of scope for me, but worth thinking about.

Verdict

The #verdict is a shorthand of the #result. There are four verdicts:

#verdict: "AC" |  "WA" | "RTE" | "TLE"

What verdicts mean

We can specify the semantics of each #verdict quite readably in CUE:

#verdict_for_result: {
    // The verdict for the _result
    _result: #result
    if _result.time >= time_limit {"TLE"}
    if _result.time < time_limit {
        if !_result.terminated_successfully {"RTE"}
        if _result.terminated_successfully {
            if !_result.output_validated {"WA"}
            if _result.output_validated { "AC"}
            }
        }
    }
}

Timing

Like #verdict, the #timing is a discrete summary of the result:

#timing: "fast enough with margin" | "fast enough" | "too slow" | "too slow with margin"

This discretises the value of result.time in terms of the three boundaries provided by problem.limit in the obvious fashion, we can specify this precisely in CUE to be clear about which inequalities are strict:

#timing_for_result: {
    // The verdict for the _result
    _result: #result
    let t = _result.time
    if t >= time_limit {
        if t >= time_limit * tle_margin {"too slow with margin"}
        if t < time_limit * tle_margin {"too slow"}
        }
    if t < time_limit {
        if t >= time_limit / ac_margin {"fast enough"}
        if t < time_limit / ac_margin {"fast enough with margin"}
    }
}

Proposal: Expectations

The role of expectations is to

  1. during problem development to ensure expected behaviour of author submissions on testdata as the problem definition, testdata, and author submissions are changed or added.

  2. define which author submissions are used to set the time limit

Expectations defined

An expectation (for a submission) is defined for a set of results, primarily in terms of verdicts, timings, messages, and score ranges:

#expectation: {
    permitted_verdicts?: [...#verdict] // only these verdicts may appear
    required_verdicts?: [...#verdict]  // at least one of these verdicts must appear
    permitted_timings?: [...#timing]   // only these timings may appear
    required_timings?: [...#timing]    // at least one of these timings must appear
    message?:                string // this judgemessage must appear
    score?:                  #range
}

#range: number | ordered_tuple
ordered_tuple: tuple=[number, number & >=tuple[0]]

A set of results satisfies an expectation if

  1. for each result, the result's verdict/timing is contained in permitted_verdicts/timings

  2. at least one of the verdicts/timings in required_verdicts/timings appears among the verdicts of the set of results

  3. the message appears (as a substring) in at least one result.message among the verdicts of the set of results

  4. every result.score is inside expectation.range

Common abbreviations

Typically a problem author will not use the fine-grained #expectations struct defined above, but instead use a common abbreviation:

#abbreviation: "accepted" | "wrong answer" | "runtime exception" | "time limit exceeded" | "does not terminate" | "not accepted"

_expectation_for_abbreviation: {
    _abbreviation: #abbreviation
    if _abbreviation == "accepted" {
        permitted_verdicts: ["AC"]
        permitted_timings: ["fast enough with margin"]
    }
    if _abbreviation == "wrong answer" {
        permitted_verdicts: ["AC", "WA"]
        required_verdicts: ["WA"]
    }
    if _abbreviation == "runtime exception" {
        permitted_verdicts: ["AC", "RTE"]
        required_verdicts: ["RTE"]
    }
    if _abbreviation == "time limit exceeded" {
        permitted_verdicts: ["AC", "TLE"]
        required_timings: ["too slow with margin"]
    }
}

The above definitions are meant to define the expectations for submissions in /submissions/accepted etc. Here are two additional, useful abbreviations:

    if _abbreviation == "does not terminate" {
        permitted_verdicts: ["AC", "RTE", "TLE"] 
        required_verdicts: ["RTE", "TLE"]
    }
    if _abbreviation == "not accepted" {
        required_verdicts: ["RTE", "TLE", "WA"]
    }
niemela commented 9 months ago

@thorehusfeldt Could you elaborate on and give examples of the syntax of the .yaml file where these are specified?

This seems to only talk about the "expectations" and not how or (at least not very explicitly) on what they are applied to.

Specifically:

niemela commented 9 months ago
#expectation: {
  permitted_verdicts?: [...#verdict] // only these verdicts may appear
  required_verdicts?: [...#verdict]  // at least one of these verdicts must appear
  permitted_timings?: [...#timing]   // only these timings may appear
  required_timings?: [...#timing]    // at least one of these timings must appear

This is treating the time check exactly like the verdict. I.e. literally splitting out the old (v0.2?) verdicts into new (v0.3) verdict, timing pairs. E.g.AC- is now AC, fast enough and TLE is TLE, too slow with margin.

I don't think this is the right approach (and, I would claim that this is the same approach as the v0.2, since it's just a trivial split). I don't think we should be setting explicit requirements on the timing states in this fasion. Could you give some examples of where this would be either needed, or at least very useful?

RagnarGrootKoerkamp commented 9 months ago

We were also talking about this regarding DOMjudge specifically. @meisterT wrote his thoughts after our discussion at: https://github.com/DOMjudge/domjudge/issues/2266

Interestingly, he also comes up with this idea of new verdicts for when submissions are inside the safety_margin.

RagnarGrootKoerkamp commented 9 months ago

There are quite a few things going on here.

We want to specify:

  1. What do the existing submissions/{accepted,wrong_answer,time_limit_exceeded,run_time_error} directories mean exactly in the current spec? Do implementations agree?
  2. What do we want these directories to mean, if different? Are we OK breaking backwards compatibility?
  3. What new 'default' directories do we want to specify?
  4. What is the YAML syntax the end-user sees in submissions/expectations.yaml?

Let me try to go over this.

1. What do submissions/* mean in the current situation

Spec The spec writes on example submissions and problem timing.

directory permitted results required results used for timing
accepted AC AC yes
wrong_answer AC,WA WA no
time_limit_exceeded AC,WA,TLE TLE yes
run_time_error AC,WA,TLE,RTE RTE no

Here is the same table using Thore's AC-/TLE- notation:

directory permitted required
accepted AC,~AC-~ AC (n.b. not AC-)
wrong_answer AC,AC-,WA WA
time_limit_exceeded AC,AC-,WA,TLE-,TLE TLE (n.b. not TLE-)
run_time_error AC,AC-,WA,TLE-,TLE,RTE RTE

BAPCtools

Problemtools TODO

DOMjudge

2. What do we want submissions/* to mean?

Observations:

Ragnar prefers being more strict:

Make time_limit_exceeded and run_time_error not allow other results, i.e.: directory permitted results required results used for timing
time_limit_exceeded AC,~WA~,TLE TLE yes
run_time_error AC,~WA~,~TLE~,RTE RTE no

Should wrong_answer/ and run_time_error/ be used for time limits?

(Equivalent: Is it OK for wrong_answer and run_time_error submissions to be inside the safety margin?)

For simplicity Ragnar prefers the current status of not using them for time limits.

Let's say we would use wrong_answer/* and run_time_error/* for time limit determination, would we then take the maximum over passing testcases, or over all testcases? I would say we should include all cases. (Note that I think this would require an additional WA- and RTE- verdicts to specify?)

3. What new default directories do we want?

Thore already suggested does_not_terminate and not_accepted (what we in NWERC call rejected so far). We also use mixed, which is for submissions that may or may not time out or crash (but do not WA).

directory permitted results required results used for timing
does_not_terminate AC,TLE,RTE TLE,RTE no
not_accepted (rejected?) AC,WA,TLE,RTE WA,TLE,RTE no
mixed AC,TLE,RTE AC,TLE,RTE no

4. YAML Syntax

(Q1) question: do we allow regex and/or glob in the submission path and testcase path? The below assumes glob only.

used_for_timing My proposal is to add a used_for_timing: bool field. Not sure whether true or false should be default. I used false as a default for now.

(Q2) question: What do you think of used_for_timing: bool? Is it clear and intuitive enough? Is it expressive enough? E.g. it does not allow setting a lower multiplier for some unreasonably slow (but still should-be-accepted) submissions.

Example YAML This is only to show syntax. Specific values can be discussed.

The implied default submissions/expectations.yaml is something like:

# Or just `accepted`? or `accepted/`?
accepted/*:
  permitted: ["AC"]
  required: []
  used_for_timing: true
wrong_answer/*:
  permitted: ["AC", "WA"]
  required: ["WA"]
time_limit_exceeded/*:
  permitted: ["AC", "TLE"]
  required: ["TLE"]
  used_for_timing: true
run_time_error/*:
  permitted: ["AC", "RTE"]
  required: ["RTE"]
does_not_terminate/*:
  permitted: ["AC", "TLE", "RTE"]
  required: ["TLE", "RTE"]
rejected/*:
  permitted: ["AC", "WA", "TLE", "RTE"] # NOTE: This could be omitted.
  required: ["WA", "TLE", "RTE"]
mixed/*:
  permitted: ["AC", "TLE", "RTE"]
  required: ["AC", "TLE", "RTE"]

or using abbreviations the below would be equivalent:

accepted/*: accepted
wrong_answer/*: wrong answer
time_limit_exceeded/*: time limit exceeded
run_time_error/*: runtime excepted # TODO: Naming of this
does_not_terminate/*: does not terminate
rejected/*: rejected
mixed/*: mixed

Further it is possible to specify verdicts for testgroups and make assertions on the score of a submission and a judgemessage that must appear for at least one testcase:

# AC that is not used for timing purposes and may be close to the timelimit.
sluggish_ac/*:
  permitted: ["AC"]
  required: []
  # used_for_timing: false # (default value)

wrong_answer/caught_by_samples.py:
  sample:
    required: ["WA", "TLE", "RTE"]
# Or equivalently:
wrong_answer/caught_by_samples.py:
  sample: rejected
# Or more precisely
wrong_answer/wa_on_samples.py:
  sample: wrong answer
# Note that this also still checks that the submission is WA (and not RTE/TLE) on all other cases because of its directory.

# Maybe we want to also have 'hidden' edge cases only exposed in secret data:
wrong_answer/wa_on_secret.py:
  sample: accepted

# TODO: Exclude a single submission from time limit determination:
# (How do we do overriding/inheriting when multiple globs/regexes match a submission? )
accepted/kinda_slow.kt:
  permitted: ["AC"]
  required: []
  used_for_timing: false
thorehusfeldt commented 9 months ago

YAML integration for @niemela, even though this is not the current focus. (We’re still at what to specify, not how to specify it.)

The simplest YAML file you could have would be one that merely specifies the expected behaviour of submission under /submissions. It would look like this:

accepted: accepted
wrong_answer: wrong answer
time_limit_exceeded: time limit exceeded
runtime_exception: run time exception

One could say that this expectations.yaml file

  1. has to be provided
  2. is assumed as default if no submissions/expectation.yaml file is present
  3. can be overriden/extended/ etc. using all kinds of rules to be clarified later. I have no opinion on what’s best in such situations.

In particular, a problem development system can completely ignore all YAML, as long as it treats the submissions in /submissions as specified above, because those expectations are part of the spec. Of course, an easy way to implement a problem development system that is consistent with the specification would just be to implement the logic of the whole expectations framework as specified above; this gives compatibility for free and full generality.

Aside from pinning down the spec, the definitions above also allow bespoke directories such as the following for “all the sluggish java submissions”

other_ac:
    permitted: ["AC"]

or more fine-grained specifications, such as Thore promising that his WA greedy submission does pass on the samples:

wrong_answer/thore.cpp:
    sample: permitted_verdicts: ["AC"]

You can go arbitrarily crazy with this and specify the behaviour of individual submissions on individual testcases (or groups):

other_ac/Thore.java:
   secret/huge/04-connected-random:
      score: [36]

The matching rules (do we prefix-match, glob, regex, wildcard) are currently in flux and downstream from us finally agreeing on how to specifiy #expectations.

thorehusfeldt commented 9 months ago

@RagnarGrootKoerkamp : thanks for the table. I think the second table should not have AC- under permitted. The meaning of accepted is that all verdicts are AC (fast enough with margin).

thorehusfeldt commented 9 months ago

@RagnarGrootKoerkamp : I played around with used_for_timing or something like that myself (apply_with_margins). I am worried about the non-monotone/asymmetric aspect, so to me it feels as if we invite a can of hurt downstream by identifying

  1. all timings must satisfy the timing constraint with a margin
  2. some timing must satisfy another timing contraint with a different margin

as a single concept. If you’re kinda confident that no terrible things will happen, I can live with the single boolean field.

niemela commented 9 months ago

2. What do we want these directories to mean, if different? Are we OK breaking backwards compatibility?

Yes. But we should only do it if it has a significant benefit.

  • I'm ignoring the edge case of zero testcases for accepted.

What is that edge case? Zero accepted is not allowed. (Or is that the edge case?)

  • DOMjudge does not work well with 'mixed' results (i.e. multiple non-AC results).
  • Because of this, BAPC/NWERC jury use submissions/rejected when more than one thing non-AC result may occur.

Another reason to use this is that you just don't case how it fails, only that it fails. I think it makes sense to have such a directory.

Make time_limit_exceeded and run_time_error not allow other results

I think I agree. Especially if we also have rejected.

Thore already suggested does_not_terminate and not_accepted (what we in NWERC call rejected so far). We also use mixed, which is for submissions that may or may not time out or crash (but do not WA).

If I understand the canonical use of does_not_terminate correctly (i.e. "this is a brute force solution that is always correct, but often too slow") then I think the name is very bad. The defining property is not that it does not terminate, it's that if it finishes before we kill it or it crashes, it is correct.

The CLICS name for not_accepted is rejected. When we have established terminology I think we should use it.

I don't love the name mixed either. Why is it needed? Why would not rejected be enough?

(Q1) question: do we allow regex and/or glob in the submission path and testcase path? The below assumes glob only.

Glob only sounds good to me. Is there an exact definition/specification that we could use? I couldn't find one immediately.

(Q2) question: What do you think of used_for_timing: bool? Is it clear and intuitive enough? Is it expressive enough? E.g. it does not allow setting a lower multiplier for some unreasonably slow (but still should-be-accepted) submissions.

I think that used_for_timing covers all what we need in practice. In theory it would be better to instead be allowed to optionallt override the ac_margin or tle_margin on a per submission basis. I'm torn.

I'm not really sure if it makes sense to have used_for_timing on non-AC or TLE verdicts.

or using abbreviations the below would be equivalent:

I'm not convinced that the abbreviations are needed.

# TODO: Exclude a single submission from time limit determination:
# (How do we do overriding/inheriting when multiple globs/regexes match a submission? )
accepted/kinda_slow.kt:
  permitted: ["AC"]
  required: []
  used_for_timing: false

Up until we introduced used_for_timing my thinking was that every permitted/required/score creates a rule, and rules can't be removed. E.g. this would always fail:

accepted/*:
  permitted: ["WA"]

Because permitted: ["AC"] is already a rule for accepted, and now it's required that all verdicts must be in ["AC"] but also in ["WA"].

With used_for_timing that would not be useful. That setting would have to override, and I would suggest it does so in yaml-file order. I.e. rules are applied from top to bottom.

This means that the example a bit above could be written as:

accepted/kinda_slow.kt:
   used_for_timing: false

The simplest YAML file you could have would be one that merely specifies the expected behaviour of submission under /submissions. It would look like this:

The simplest file, would be the empty file, which would then not have to exist.

I would suggest that the semantics should be that the default directory behavior is assumed to be added to the top of the file. I.e. the accepted/kinda_slow.kt example just above would be a complete file that has all the default behavior except that accepted/kinda_slow.kt is not used for timing.

niemela commented 9 months ago

I am worried about the non-monotone/asymmetric aspect, so to me it feels as if we invite a can of hurt downstream by identifying

1. _all_ timings must satisfy the timing constraint _with a margin_

2. _some_ timing must satisfy _another_ timing contraint with a _different_ margin

as a single concept. If you’re kinda confident that no terrible things will happen, I can live with the single boolean field.

@thorehusfeldt I don't understand what you're saying here, so I can't be confident that it wont happen. Could you explain more?

Tagl commented 9 months ago
  • I'm ignoring the edge case of zero testcases for accepted.

What is that edge case? Zero accepted is not allowed. (Or is that the edge case?)

As I understand it, the edge case where there are no testcases within a test group.

niemela commented 9 months ago
  • I'm ignoring the edge case of zero testcases for accepted.

What is that edge case? Zero accepted is not allowed. (Or is that the edge case?)

As I understand it, the edge case where there are no testcases within a test group.

Ah, I realize now that I read "submission" and not "testcases". This makes much more sense. Thanks :).

That said... is that an edge case? Every "permitted" rule will always succeed on an empty set, (and every non-empty required rule will always fail). So any accepted submission will always succeed on the empty set of testcases. That feels right, right? (Useless of course, but right).

thorehusfeldt commented 9 months ago

Great progress. Let me try to pin down used_for_timing precisely.

First, let me change the name to with_margins. This is preferable because it indicates a constraint rather than a procedure. (It is also slightly less open to misunderstandings. All submissions have “something to do with timing”, so used for timing means optionally_used_for_determining_the_time_limit.)

In particular, there could be traditions where the time_limit is set explicitly in problem.yaml and then the role of with_margins is to check that all the AC-submissions pass their timing constraint with margin. (Instead of setting “the timing.”)

With that out of the way: (recall that an expectation is something a submission has on a set of #results. Call this set R

The expectation #expectation.with_margin means that

  1. every result in R with verdict AC satisfies the stricter constraint result.time < time_limit / ac_margin.
  2. if R contains a result with verdict TLE then some result in R with verdict TLE satisfies the stricter constraint result.time >= time_limit * tle_margin.

This is the best I can do, pay particular attention to the if-part of item 2. (If we don’t have that, all hell breaks lose, as far as I can tell.) Please improve.

The alternative is to separate the concepts “all AC must run with margin” and “there is at least one TLE that is rejected with margin”, as in version 0.3 above (or simplifications.)

thorehusfeldt commented 9 months ago

(Useless of course, but right).

These situations happen all the time, for instance when you run the -d flag in problemtools. That allows you to restrict data to, say, “every in-file that contains the substring huge”, and such restrictions routinely lead to empty testgroups during development. (For instance, a -d-filtered sample is often empty.) problemtools crashes in such situations, such as reported in this issue: https://github.com/Kattis/problemtools/issues/189 (and subsequently fixed)).

niemela commented 9 months ago

The expectation #expectation.with_margin means that

  1. every result in R with verdict AC satisfies the stricter constraint result.time < time_limit / ac_margin.
  2. if R contains a result with verdict TLE then some result in R with verdict TLE satisfies the stricter constraint result.time >= time_limit * tle_margin.

I don't think the definition should be over whether R contains AC, but rather whether there is a a rule containing AC. Specifically, a submission in time_limit_exceeded would probably have some AC verdicts (the samples if nothing else), and with this definition all such verdicts are required to satisfy the stricter lower constraint. That is not what we want.

niemela commented 9 months ago

(Useless of course, but right).

These situations happen all the time, for instance when you run the -d flag in problemtools. That allows you to restrict data to, say, “every in-file that contains the substring huge”, and such restrictions routinely lead to empty testgroups during development. (For instance, a -d-filtered sample is often empty.) problemtools crashes in such situations, such as reported in this issue: Kattis/problemtools#189 (and subsequently fixed)).

I think this means that you agree that it is correct?

thorehusfeldt commented 9 months ago

with this definition all such verdicts are required to satisfy the stricter lower constraint. That is not what we want.

You correctly observe that a symmetric concept like “we want to use this for time-limit-y stuff” is ill suited to express two intentions (“we want AC to mean this-and-that for all AC results” and “we want TLE to mean this-and-that, for some TLE results.) Hence my squeamishness in adopting this idea.

However, I try to be constructive and help by at least attempting to write down what used_for_timing could mean. It would be more helpful if one of you (who reject my proposals for more fine-grained AC-verdicts, or for #timing) would clearly specify the intended semantics.

RagnarGrootKoerkamp commented 9 months ago

As I understand it, the edge case where there are no testcases within a test group.

For accepted submissions, we say permitted: ["AC"], but if you also put required: ["AC"] that fails when there are 0 testcases. So we should probably stick with required: [] for this case.

Fredrik: I don't love the name mixed either. Why is it needed? Why would not rejected be enough?

rejected is for submissions that must fail in some way.

mixed is for submissions that may or may not fail, e.g. depending on hardware speed. I think we also occasionally use this currently for submissions that AC in practice, but should not be used for timelimit determination.

Fredrik: Q1: Glob only sounds good to me.

That's also my weak preference.

Fredrik: Q2: I think that used_for_timing covers all what we need in practice.

Yes, this is also my feeling.

Fredrik: With used_for_timing that (appending rules) would not be useful. That setting would have to override, and I would suggest it does so in yaml-file order. I.e. rules are applied from top to bottom.

I agree this is nice, but sadly YAML dictionaries are explicitly unordered :/ We'll have to think about this more.


Thore: First, let me change the name to with_margins.

I think that works for me. Give me a day to get used to it :)

Thore: It would be more helpful if one of you would clearly specify the intended semantics.

Ok so the way I initially think about with_margins: true is that the maximum runtime over all current testcases (i.e. with any result) must not be in [time_limit / ac_to_time_limit, time_limit * time_limit_to_tle). (Current is explained later.)

Note:

Examples:

Let's think about how this interacts with:

  1. Specifying requirements for a subset of testcases (e.g. a single testgroup).
  2. Running only on a subset of testcases.

1. with_margins for testgroups:

Take for example:

partially_accepted/ragnar.py:
  sample:
    permitted: [`AC`]
    with_margins: true
  secret:
    required: [`TLE`]
    with_margins: true

This should (to me at least) mean that on the samples the max running time is below the safety window, and on the secret cases the max running time is above the safety window.

So what we really want is that whereever with_margins: true appears, the max running time over the testcases to which the with_margins: true is scoped is outside the safety margins.

2. Running a subset of cases:

(This doesn't have to be part of the spec, which only talks about running/validating the full set of testcases. But still it may be useful to think about how we would handle this case.)

Suppose we run only a subset T of testcases.

niemela commented 9 months ago

As I understand it, the edge case where there are no testcases within a test group.

For accepted submissions, we say permitted: ["AC"], but if you also put required: ["AC"] that fails when there are 0 testcases. So we should probably stick with required: [] for this case.

Agreed. (But also, leaving out required should mean required: [] just like leaving out permitted should mean permitted: [<all of them>])

mixed is for submissions that may or may not fail, e.g. depending on hardware speed. I think we also occasionally use this currently for submissions that AC in practice, but should not be used for timelimit determination.

Right, you defined it as:

mixed/*:
  permitted: ["AC", "TLE", "RTE"]
  required: ["AC", "TLE", "RTE"]

This seems to me to be the same use case (but ever so slightly different definition) as what @thorehusfeldt calls does_not_terminate, and what I would like to call something like correct_but_slow (name is certainly WIP).

I agree this is nice, but sadly YAML dictionaries are explicitly unordered :/ We'll have to think about this more.

Right, I knew that... :(. That's very annoying though.

  • This only makes a statement on the maximum runtime over all testcases. It says nothing on a per-testcase/per-result basis.

FTR, I think this is the right approach for used_for_timing.

  1. with_margins for testgroups:

Do we actually need to allow used_for_timing a.k.a. with_margins on testgroups as well? Do you have some concrete examples for when that granularity would be very useful?

RagnarGrootKoerkamp commented 9 months ago

This [mixed] seems to me to be the same use case as does_not_terminate.

They are quite similar indeed. The only distinction is that mixed has required: [AC,TLE,RTE], so the submission may also end up being AC everywhere, whereas does_not_terminate has required: [TLE,RTE], meaning it must fail.

But I agree that mixed isn't very specific and could possibly be specified by the user on a case-by-case basis.

does_not_terminate, and what I would like to call something like correct_but_slow

yes, I was thinking of something similar.

with_margins on testgroups

If you have 3 testgroups, A, B, and C, I can imagine some submission that should pass A (with margin), fail B (with margin), and fail C (also with margin).

The running time could explode on C, but you would still want that on B it also fails with margin.

I'm not sure if this 100% makes sense with scoring problems though. There you may want to ensure that the score for group B is 0, and that it times out for each testcase in group B? One way we could do that is to distinguish secret/group_b (which applies to the entire group) from secret/group_b/* (which applies individually to each testcase in the group). Surely this could work but I'm not sure it's a good idea.

In the AC-, TLE- proposal you could do

submission.py:
  secret/group_b:
      permitted: [TLE] # Note: not TLE-

whereas now this becomes

submission.py:
  secret/group_b/*:
      permitted: [TLE]
      required: [TLE]
      with_margins: true
thorehusfeldt commented 9 months ago

Here is a case where with_margins is useful for testgroups. Think of an IOI-style question, where data/secret/group1 is what-IOI-calls-a-subtask. Then I want to specify

mixed/brute-force.py:
  sample: accepted
  secret/group1: accepted
  secret/group2: time limit exceeded

Here it is crucial that time limit exceeded means (using proposal 0.3 for clarity):

required_verdicts: ["TLE"]
required_timings: ["too slow with margin"]
niemela commented 9 months ago

How about this?

# file glob specifying some set of files. (1)
<file glob>:
   # the calculated verdict for the submission (2)
   verdict: [<verdicts>]
   # score of the submission (3)
   score: <number or range>
   # permitted and required testcase verdicts (4)
   permitted: [<verdicts>]
   required: [<verdicts>]
   # these submissions are used for timing (5) (6)
   used_for_timing: <bool>
  # testcase glob specifying some set of testcases or groups. (7)
   <testcase glob>:
      # score on the testcase or group (8)
      score: <number or range>
      # permitted and required testcase verdicts (9)
      permitted: [<verdicts>]
      required: [<verdicts>]
      # the result on this testcase or group of this submission is used for timing (10)
      used_for_timing: <bool>
   # multiple sets of testcases or groups can be specified
   <testcase glob>:
      permitted: [<verdicts>]
      required: [<verdicts>]
      score: <number or range>
      used_for_timing: <bool>

Verdicts are "AC", "WA", "RTE", "TLE".

(1): Since a program is "always represented by a single file or directory" we can't (or shouldn't?) allow submissions that are not exactly one directory level down. I.e. accepted/sol1.cc is ok, but accepted/fast/sol1.cc and sol1.cc are not.

(2): Set of verdicts. The verdict of the submission must be one of them. Note that this does not directly say anything about verdicts of testcases, and allow judging to be short circuited if that's how the problem is specified.

(3): Score for the submission. If a number must be the exact score given, if a range must be within the range (i.e. >= the lower end and <= the upper end). Note that a non-0 score implies that verdict is "AC". How should matching on score behave when verdict is not "AC"? Should the score always be 0 in that case, should it be allowed to be 0, or should it be "null"?

(4): All verdicts off testcases must be in the permitted set, some verdicts of testcases must be in the required set. Default for permitted is "AC", "WA", "RTE", "TLE". Maybe "restricted" is a better name than "permitted"? Not sure. In earlier examples we have used required: [] to mean "nothing is required", but the definition we have used so far actually means that required: [] is an automatic fail. To say "nothing is required" we would have to say required: ["AC", "WA", "RTE", "TLE"]. Right?

(5): That a submission is used for timing means that it must be possible to set a time limit such that the time for the submission (i.e. the maximum of the times for testcases) is not within any of the margins from the time limit. Note that if the time limit is explicitly set, than that is the only time limit that can be set. Defaults to false. I think this is the same as what @RagnarGrootKoerkamp said?

(6): Alternatively we could have some kind of time_limit_margin that overrides the problem wide setting (in problem.limits) for these files. Logic would then be "the same" as under (5). Setting the time_limit_margin to 0 would have the same effect as setting used_for_timing to false. We need two margins, so time_limit_margin should be a tuple?

(7): If these are just any glob than permitted or score would technically be allowed. That said any useful glob would have to start with sample, secret, or *. Should we require that?

(8): Score for the testcase/group(s) specified by the glob.

(9): Similar to (4), but applies separately to every testcase/group specified by the glob.

(10): Similar to (5) and (6) but applied to a single testcase/group.

Note that (5), (6) vs (10) implies that the following are different:

accepted/sol1.cc:
   used_for_timing: true

accepted/sol1.cc:
   */*:
      used_for_timing: true
RagnarGrootKoerkamp commented 9 months ago

@niemela nice writeup!

(1): I'm OK with allowing none and arbitrary nesting.

(2): I think Thore explicitly steered away from assertions on the final verdict, and given permitted and required, I think we don't need additional assertions on verdict and we can just leave this out.

(3): I'd say non-AC should simply have score = 0.

(4): Good point re. required: [] being wrong. Absense of required: (and possibly required: null) should be used instead.

(5): yes sgtm

(6): My feeling is that for now it's better to not go with explicit overridable margins yet, since one can always adjust the problem-wide margin if things are wrong. One thing that we could do using more granular margins is e.g. setting a 5x ac_to_time_limit margin for C++ and a 1.1x ac_to_time_limit margin for python. But I kinda think just setting a fixed margin over the slowest AC is flexible enough. (Having 5x over C++ I would only do if I didn't have a python submission in the first place and the large buffer is just to compensate for the lack of slow AC submissions. It's better to actually write that slow submissions and then set a lower margin. If python becomes faster in the future, then the time limit goes down, which is good.)

(7) Yes it should start with sample or secret. I suppose allowing it to start with a literal * also works.

(8,9,10) Hmmmm, the grouping/globbing is quite tricky:

I think this creates some extra complexity. We should think about if we really need this.

If we do want it, it's probably better to store this alongside the existing fields rather than overloading the glob field. Thore's original suggestion here between required_timings and permitted_timings makes sense: we want a "for all: with_margins" and a "there exists: with margins". So maybe we can have with_margins:true for the max runtime of the selected cases and a each_with_margins:true to apply to each selected testcase individually. Oh, or we can even do with_margins: "each" | true.

(11): Looking at Thore's last post above, I do think abbreviations are quite nice, especially when used with testgroups. That raises the question: How to specify an abbreviation for the full submission, in combination with additional contraints on testgroups? You can't do both submission.py: wrong answer and submission.py: sample: accepted at the same time.

Maybe we need an "abbreviation" field for this?

thorehusfeldt commented 9 months ago

(11): Looking at Thore's last post above, I do think abbreviations are quite nice, especially when used with testgroups. That raises the question: How to specify an abbreviation for the full submission, in combination with additional contraints on testgroups? You can't do both submission.py: wrong answer and submission.py: sample: accepted at the same time.

You can; my proposal is supposed to exactly make this possible. For a realistic example, consider

thore.*: wrong answer
*.py:
  sample: accepted

The submissions thore.py satisfies both rules, so both rules apply. In particular, the set of all testcase results must include WA, and the set of sample testcase results can only include AC. This is well defined (because all “my” rules have obvious compositionalities.)

thorehusfeldt commented 9 months ago

(8,9,10) Hmmmm, the grouping/globbing is quite tricky:

Only tricky for scoring problems, and then it gets really tricky indeed. But only some of us care about scoring problems, so I’ve decided to put very little emphasis on that.

thorehusfeldt commented 9 months ago

That a submission is used for timing means that it must be possible to set a time limit such that the time for the submission (i.e. the maximum of the times for testcases) is not within any of the margins from the time limit. Note that if the time limit is explicitly set, than that is the only time limit that can be set. Defaults to false. I think this is the same as what @RagnarGrootKoerkamp said?

Very cool. Does this really work (if so: great)? Let me try to rephrase. Again, R is a set of results.

If with_margins is true then max(r.time) over all r in R is either < timelimit/ac_margin or >= timelimit*tle_margin.

Is that what we mean?

RagnarGrootKoerkamp commented 9 months ago

For a realistic example, consider

thore.*: wrong answer
*.py:
sample: accepted

What if we want to apply both of these rules only to thore.py? In that case some kind of workaround like

thore.py: wrong answer
thore.py*:
  sample: accepted

may be needed. (But I'm happy accept this as a niche case that's not very pretty.)

If with_margins is true then max(r.time) over all r in R is either < timelimit/ac_margin or >= timelimit*tle_margin.

Is that what we mean?

Yes :)

niemela commented 9 months ago

We (@Tagl @RagnarGrootKoerkamp @thorehusfeldt and myself) had a fairly long and somewhat productive (IMO) call to discuss the details. Here are some things I noted from that meeting. Not in order of importance, only numbered for reference:

1) We agree that submissions must be always one directory level down, since otherwise we can't really tell directory full of submissions apart from a directory that is a submission. I.e. these are all acceptable submissions: accepted/sol1.py, accepted/sol2/{header.h, main.c}, and these are not: sol3.cc, accepted/very_fast/sol4.cc.

2) The rules for combining permitted and required are clear. Every one is a rule, and they all have to be fulfilled. (This is equivalent with saying that permitted combines to the intersection, but I don't think there is a similar set theory rule for required?).

3) The rules for combining settings of used_for_timing is as of yet unclear. Combining them the same was as (2) would not be useful since doing that with two instances of the same value does nothing, and different values would always fail. Clearly there need to be some overriding. Doing it in file order seems reasonable but is against YAMLs definition of maps (they are explicitly unordered). Toward the end of the call (so we have not thought this through yet) it was suggested that given (1) (i.e. that all submissions is on the form <directory>/<submission>) we could say that fileglobs that target submissions always override fileglobs that target directories, and that targeting either with contradicting values is an error.

Examples:

accepted:
   used_for_timing: true

accepted/sol1.cc:
   used_for_timing: false
# this is ok, everything in accepted except accepted/sol1.cc is used for timing.
---
*/*.py:
   used_for_timing: false

accepted/really_fast.py
   used_for_timing: true
# not ok, accepted/really_fast.py is given contradicting values.

4) We agreed that setting a tuple of time_limit_margin is more powerful, kinda nice, but probably overkill. used_for_timing is simpler and more directly what one would want to do in most cases anyway.

5) permitted and required when set on files always refer to the leafs of the test data group tree structure. I.e. all of the testcases. This implied that permitted (that is not set to all values) requires a validation to run submissions on all testcases, there can be no short circuit. required OTOH, can be short circuited as soon as a required testcase verdict has been found. Using either permitted and required could force the running of testcases on a submission that would not be run during "normal" judging (if it is short circuited). A (pair of) longer, more correct, but... probably too long names would be permitted_testcase_verdicts and required_testcase_verdicts.

6) verdict when set on files refer to the top level verdict of the submission. This implies it can be short circuited in the same way that "normal" judging can.

7) We disagreed quite a bit on what the "testcase globs" could refer to, just testcases, or also test groups. I.e. does secret/group1 refer to the test data group of that name or all the test cases in that group.

8) We disagreed quite a bit on what the semantics got rules on "testcase globs" would be. Applying the rules individually on all test cases (or groups?) would be consistent with how the filename globs work, but applying the rules once to the set of all test cases matched would be more powerful.

9) Taking 7 & 8 together, maybe we should have separate ways of targeting testcases and test groups? (I think that's more complication than it's worth).

niemela commented 9 months ago

I talked some more with @Tagl, he came up with this idea for making the testcase matching just different enough that we could use the more powerful semantics (discussed under 8 above).

Instead of (something like) this:

<file glob>:
   verdict: [<verdicts>]
   score: <number or range>
   permitted: [<verdicts>]
   required: [<verdicts>]
   used_for_timing: <bool>
   <testcase glob>:
      score: <number or range>
      permitted: [<verdicts>]
      required: [<verdicts>]
      used_for_timing: <bool>
   <testcase glob>:
      permitted: [<verdicts>]
      required: [<verdicts>]
      score: <number or range>
      used_for_timing: <bool>

we could do:

<file glob>:
   verdict: [<verdicts>]
   score: <number or range>
   permitted: [<verdicts>]
   required: [<verdicts>]
   used_for_timing: <bool>
   testcases:
   - matches: <testcase glob>
     score: <number or range>
     permitted: [<verdicts>]
     required: [<verdicts>]
     used_for_timing: <bool>
   - matches: <testcase glob>
     permitted: [<verdicts>]
     required: [<verdicts>]
     score: <number or range>
     used_for_timing: <bool>

Thoughts?

eldering commented 9 months ago

We (@Tagl @RagnarGrootKoerkamp @thorehusfeldt and myself) had a fairly long and somewhat productive (IMO) call to discuss the details. Here are some things I noted from that meeting. Not in order of importance, only numbered for reference:

  1. Shouldn't accepted/sol2/{header.h, main.c} always be written as accepted/sol2/ instead (with the last / optional)?
  2. For required it's the union, right? required = S with a set of testcase verdicts T means: ∀ s in S: ∃ t in T: s == t, so if you have two required sets S_1 and S_2, then the logical and of the two expressions is equivalent to ∀ s in (S_1 ∪ S_2): ∃ t in T: s == t, I think.
  3. The proposal to override at deeper level of nesting sounds fine to me. I haven't thought it through either though.
  4. SGTM
  5. This should say for permitted can not be short circuited, but I guess that's just a typo.
  6. Not sure the verdict on submission level belongs here. Did we make verdict aggregation to submission level part of the problem format specification?
  7. (also point 8, I guess) I'm not sure I understand, but isn't it easy to let secret/group1 refer to the whole group and secret/group1/* to each test case individually?

A lot of progress was made!

niemela commented 9 months ago
  1. Shouldn't accepted/sol2/{header.h, main.c} always be written as accepted/sol2/ instead (with the last / optional)?

Yes, it should absolutely be written (in the YAML file) as accepted/sol2. I listed them as accepted/sol2/{header.h, main.c} to make it clear that I was referring to a submission consisting of multiple files in a directory.

2. For required it's the union, right? required = S with a set of testcase verdicts T means: ∀ s in S: ∃ t in T: s == t, so if you have two required sets S_1 and S_2, then the logical and of the two expressions is equivalent to ∀ s in (S_1 ∪ S_2): ∃ t in T: s == t, I think.

No, I don't think so? required means that at least one of the verdicts given needs to be the result of some test case, not that all of the verdicts given needs to be. At least that's what Thore's specification (see the original message in this thread) says. @thorehusfeldt can you clarify?

5. This should say for permitted can not be short circuited, but I guess that's just a typo.

Yes, thanks for catching that. I have fixed the typo in the comment above.

6. Not sure the verdict on submission level belongs here. Did we make verdict aggregation to submission level part of the problem format specification?

Well, this was apparently a point of a lot of divergence. I would argue that "verdict aggregation to submission level" was always part of the format, and furthermore that it absolutely, positively must be. If we don't even agree on what the correct judgement is, it's not much of a standard it is? OTOH, there is a weaker form of agreement we could aim for where we must agree on accepted vs rejected, but are allowed to possibly reject differently (because that matters a lot less).

7. (also point 8, I guess) I'm not sure I understand, but isn't it easy to let secret/group1 refer to the whole group and secret/group1/* to each test case individually?

I think so, yes. If I understand the counterpoint correctly, the testcase rules are supposed to match on sets of testcases, and never groups. With that view, then secret/group1 becomes "the testcases of group1" which is the same set of testcases as secret/group1/*.

A lot of progress was made!

Yes. :+1:

thorehusfeldt commented 9 months ago

The most dramatic change that came out of the meeting yesterday – nullifying three months of exchanges in various channels, the consensus of the August meeting in Lund, and all my concrete, thorough, and complete proposals since August – is that we discovered that there is no consensus about the following claim:

expectations are constraints on sets of testcase-verdicts, judgemessages and testcase scores.

Instead, expectations are now something much richer and also express constraints on aggregated verdicts. (These are themselves defined in terms of aggregation rules for sequences of testcase-verdicts.)

Thus, a testcase-result now also needs to include the name of the testcase (or some other total order of the test-case verdicts) in order to define #expectations. This would make it much easer to connect it with testdata.yaml and allow expectations.score to be clearly defined. So I’m moderately optimistic.

However, these changes were suggested 38 minutes before the meeting yesterday, and I haven’t had as much time (compared to the three 3 months we’ve discussed proposals 0.1, 0.2, and 0.3) into thinking that through.

I will try to pin down what I think we mean with this and re-emerge with expecations 0.4.

thorehusfeldt commented 9 months ago

required means that at least one of the verdicts given needs to be the result of some test case,

How can that be hard to find out? I quote myself from expecations 0.3 at the top of this page:

#expectation: {
    permitted_verdicts?: [...#verdict] // only these verdicts may appear
    required_verdicts?: [...#verdict]  // at least one of these verdicts must appear
    ....
}

BTW, these fields will now probably be renamed to required_testcase_verdicts and permitted_testcase_verdicts, in order to distinguish them from the new permitted_aggregated_verdict, which has very different semantics.

meisterT commented 9 months ago

How does this proposal relate to https://github.com/Kattis/problem-package-format/issues/112?

thorehusfeldt commented 9 months ago

It is the third or fourth iteration.

meisterT commented 9 months ago

Does this mean the other issue is obsolete and can be closed? Or does this add on top of the other version?

niemela commented 9 months ago

expectations are constraints on sets of testcase-verdicts, judgemessages and testcase scores.

Instead, expectations are now something much richer and also express constraints on aggregated verdicts. (These are themselves defined in terms of aggregation rules for sequences of testcase-verdicts.)

This claim confuse me. Aggregated verdicts are themselves "constraints on sets of testcase-verdicts" so constraints on testcase-verdicts and aggregated verdicts are still constraints on sets of testcase-verdicts?

Or is the key insight here the difference between sets of testcase-verdicts versus sequences of testcase-verdicts?

niemela commented 9 months ago

@thorehusfeldt:

How can that be hard to find out? I quote myself from expecations 0.3 at the top of this page: [...]

Yes, the section you quote is what I based my understanding of.

The source of the confusion is the suggestion that there is some set theoretic combination of required. Is there?

You say:

This is well defined (because all “my” rules have obvious compositionalities.)

How does multiple required compose? I can't fine the description of that?

niemela commented 9 months ago

Does this mean the other issue is obsolete and can be closed?

Yes, I have closed it.

Or does this add on top of the other version?

In the sense that all our discussions build on top of all our earlier discussions and insights, it kinda does "add on top". That said, closing a ticket does not delete it, we can still read and refer to it. (And even reopen if we feel the need).

thorehusfeldt commented 9 months ago

Aggregated verdicts are themselves "constraints on sets of testcase-verdicts"

They are not; the important word here is “set”.

Aggregated verdicts are constraints on totally ordered sequences of testcase-verdicts. (In particular, they are defined in terms of which non-AC verdict comes first.)

And aggregate scores are something even richer: they are constraints on trees of verdicts, modified by the aggregation rules of testdata.yaml.

Sets are not totally ordered sequences are not partial orders. Note for instance the very first paragraph in my own introduction from yestersday:

We want to specify the expected behaviour of submissions on collections of testcases. (Collections could be singletons, sets, lists (so they’re ordered), or trees, optionally with testdata.yaml applying to internal nodes. Not important here.) [https://github.com/thorehusfeldt/problem-package-format/wiki]

This was literally my opening sentence yesterday, and I highlighted the choice of the word “collection” (to encompass all three cases) verbally. You may be underestimating my ambition for precision. :-)

niemela commented 9 months ago

They are not; the important word here is “set”.

Aggregated verdicts are constraints on totally ordered sequences of testcase-verdicts. (In particular, they are defined in terms of which non-AC verdict comes first.)

I would maybe say that aggregated verdicts are constraints on sets of testcase, verdict pairs. testcases themselves are, of course, ordered.

niemela commented 9 months ago

This was literally my opening sentence yesterday, and I highlighted the choice of the word “collection” (to encompass all three cases) verbally. You may be underestimating my ambition for precision. :-)

So you did say that wanted to "specify the expected behaviour on [...] lists [and] trees", but that now that we do that is "nullifying three months of exchanges..."? I'm actually very confused.

thorehusfeldt commented 9 months ago

How does multiple required compose? I can't fine the description of that?

A submission must satisfy all the expectations that apply to it. (For instance, all the expectations that match – by whatever rule we can agree on – its filename. Or are specified in the source code. Or hardcoded into the specification. Or… it’s not important.)

So (by whatever rule) assume submission S is expected to satisfy expectation E and F. Then your expectation-checker could, for instance, validate S against E and then validate S against F and only accept if both validations were successful.

This would be equivalent to collecting all required fields in E and all required fields in F and computing their union; as well as collecting all permitted fields in E and all permitted fields in F and computing their intersection. It’s not important which of these to strategies the checker uses. They give the same result (which, I argue, is also the intuitive semantics.)

niemela commented 9 months ago

This would be equivalent to collecting all required fields in E and all required fields in F and computing their union; as well as collecting all permitted fields in E and all permitted fields in F and computing their intersection. It’s not important which of these to strategies the checker uses. They give the same result (which, I argue, is also the intuitive semantics.)

No, I believe that's incorrect.

Let's say expectation E is required(WA, AC) and F is required(TLE, RTE). Fulfilling both of these means that "at least one of [WA or AC] must appear" and "at least one of [TLE or RTE] must appear". In this example this requires a minimum of two different things to appear. The union of WA, AC and TLE, RTE is WA, AC,TLE, RTE, and fulfilling required(WA, AC,TLE, RTE) means "at least one of [WA, AC,TLE, or RTE] must appear". In this example this requires only one thing to appear, so clearly these two things are not the same.

What am I missing?

thorehusfeldt commented 9 months ago

You're right, I'm wrong. Thank you.

My first semantics is the intention: A submission must satisfy all expectations that apply to it.

niemela commented 9 months ago

Closing this. Discussion continues in #137 .