Kattis / problem-package-format

Kattis problem package format specification
https://www.kattis.com/problem-package-format
9 stars 14 forks source link

When must default answer files exist? #98

Closed thorehusfeldt closed 5 months ago

thorehusfeldt commented 1 year ago

Honest question. The specification is silent about when a test case must provide .ans.

  1. I’m reasonably sure that for validation: default it is true that if <base_name>.in exists then <base_name>.ans must also exist. We should say that.
  2. I think this is true for validation: custom. This is highly implicit, but since judge_answer in the specification needs a value, the adventurous reader can guess that this value was taken to be <base_name>.ans. I’m not sure, though.
  3. The answer is not “always”, because explicitly now, for validation: interactive: True there need be no such file in data/sample

Is the rule “whenever an input file .in exists then a default answer file .ans must also exist”? (I think so.)

RagnarGrootKoerkamp commented 1 year ago

I like the simple requirement that a .ans must always exist, but indeed, for interactive problems we tend to just create empty files there, so I'm also OK with 'a .ans must exist unless the problem is interactive, in which case it may exist.'.

(Whether the file is 'default'/empty or not is independent of this. I don't think we should ever specify anything about its contents, only about its existence.)

Tagl commented 1 year ago

I agree with Ragnar about the simple requirement, simple is good. But if there is strong desire to make it optional for interactive problems then that's fine by me too.

jsannemo commented 1 year ago

While interactive validators never strictly "need" the ans in the same way as a general custom validator, it certainly happens. For example, in https://open.kattis.com/problems/bitgame the .ans contains the answer (YES/NO) that is correct.

Thus I definitely prefer requiring .ans to exist; it feels a bit too risky to leave it to be empty by default instead (since we can't know if the author signals that intent or forgot a particular .ans file).

thorehusfeldt commented 1 year ago

So, let me try:

RagnarGrootKoerkamp commented 1 year ago
  • a test case tc is defined by the contents of it input file (tc.in) (and not the filename or even pathname.)

How/where is this relevant? (It seems unrelated to this specific issue.)

I would say the presence of a file/symlink (but maybe not broken symlink, and definitely not directory) /path/to/tc.in makes the testcase 'exist'.

  • if /path/to/tc.in exists then /path/to/tc.ans must also exist.

agreed

niemela commented 1 year ago
  • a test case tc is defined by the contents of it input file (tc.in) (and not the filename or even pathname.)

How/where is this relevant? (It seems unrelated to this specific issue.)

I was thinking the same.

thorehusfeldt commented 1 year ago

I’m merely trying to define, maybe only to myself, what “a testcase” is. I don’t find it easy.

(For instance are data/sample/1 and data/secret/023-edgecase-08 the same testcase when they have identical contents? By now I think yes.)

jsannemo commented 1 year ago

I don't know in what context "same testcase"-iness is relevant, but at least for me they would also have to match testdata.yaml settings.

niemela commented 1 year ago

(For instance are data/sample/1 and data/secret/023-edgecase-08 the same testcase when they have identical contents? By now I think yes.)

If they are the same, what is the name of that testcase?

Tagl commented 1 year ago

I’m merely trying to define, maybe only to myself, what “a testcase” is. I don’t find it easy.

(For instance are data/sample/1 and data/secret/023-edgecase-08 the same testcase when they have identical contents? By now I think yes.)

They are not necessarily the same, but the contents are equivalent and therefore the testcase is equivalent. A judge could choose to run the equivalent testcase or reuse the result. Randomized solutions may succeed once but fail the second time.

RagnarGrootKoerkamp commented 1 year ago

Please correct me if I'm wrong:

So currently, having two equal .in files (not symlinked) could be treated as either

I'm leaning towards:

Tagl commented 1 year ago

Problemtools warns for identical testcases. It is not an error.

On Sat, Jul 29, 2023, 20:19 Ragnar Groot Koerkamp @.***> wrote:

Please correct me if I'm wrong:

  • symlinks for included testcases reuse the verdict from the pointed-to testcase. (output_validator_flags should change this when they differ, but Kattis doesn't implement that. Not sure what the spec says.)
  • Problemtools does not allow identical testcases otherwise (I think)

So currently, having two equal .in files (not symlinked) could be treated as either

  • reuse the result, as if one of them (the lexicographically larger one) were a symlink
  • it's a new testcase, that just happens to have the same contents.

I'm leaning towards:

  • symlinks: reuse if possible, ie output_validator_flags the same
  • identical .in files: new independent testcase. Maybe warn for this but otherwise OK.

— Reply to this email directly, view it on GitHub https://github.com/Kattis/problem-package-format/issues/98#issuecomment-1656872356, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2ZBKS3RE6YA7U2KWDGZELXSVV6NANCNFSM6AAAAAA23ENXKU . You are receiving this because you commented.Message ID: @.***>

jsannemo commented 1 year ago

symlinks: reuse if possible, ie output_validator_flags the same identical .in files: new independent testcase. Maybe warn for this but otherwise OK.

I think I'm strongly against symlinkness having a semantic value in the spec. That way madness lies.

RagnarGrootKoerkamp commented 1 year ago

I think I'm strongly against symlinkness having a semantic value in the spec. That way madness lies.

Sounds fair. But that would then indeed mean that identical contents implies it's 'the same' testcase as is not rerun unless output_validator_flags are different. (Assuming we want to keep the existing behavior of not rerunning included cases.)

thorehusfeldt commented 1 year ago

I note that there is no consensus about what a test case is. I encourage you all to try to define it.

jsannemo commented 1 year ago

as is not rerun unless output_validator_flags are different.

This is in no way mandated by the spec, and wasn't even Kattis default behavior a decade ago. It became the default behavior mostly to support testgroup inclusion for IOI problems. I'm not convinced there's a sufficient value in mandating this judging behavior in the spec.

jsannemo commented 1 year ago

I note that there is no consensus about what a test case is. I encourage you all to try to define it.

A test case is a name data/secret/023-edgecase-08 for which there is an .in and .ans file, is at least my definition.

Two test cases can of course be indistinguishable, equivalent, "reused" or whatever we want to say, but I'm pretty sure the above definition is what most tools, judges and systems define to be a test case.

@thorehusfeldt

RagnarGrootKoerkamp commented 1 year ago

as is not rerun unless output_validator_flags are different.

This is in no way mandated by the spec, and wasn't even Kattis default behavior a decade ago. It became the default behavior mostly to support testgroup inclusion for IOI problems. I'm not convinced there's a sufficient value in mandating this judging behavior in the spec.

Fine with me, but then we should probably explicitly mention that this is not supported, and that either

jsannemo commented 1 year ago

Maybe I misunderstood you. My stance is:

RagnarGrootKoerkamp commented 1 year ago

Maybe I misunderstood you. My stance is:

Ah no I misread your reply at some point.

  • Test cases that have the same .in and .ans but different output validator flags must be run once each with their correct flags

SGTM

  • We shouldn't mandate that test cases for which the tuple (in, ans, validator flags) is the same are only run once: that is an optimization a judge may or not apply depending on its configuration.

OK for me. Then we should explicitly allow it though. I think we should only check (in, ~input_validator_flags,~ output_validator_flags) for this though? Technically a custom validator can read the .ans to be more/less lenient with the .in but that sounds weird. (But I could see some possible usecases for this so including .ans is also OK)

jsannemo commented 1 year ago

Then we should explicitly allow it though

Yes, that makes sense.

I think we should only check (in, input_validator_flags, output_validator_flags) for this though?

Input flags aren't used for judging, so I don't think they should apply. In particular, including them breaks subtask reuse (since subtasks have different input flags).

Technically a custom validator can read the .ans to be more/less lenient with the .in but that sounds weird.

If an author uses the .ans in this way (we often put e.g. strategy descriptors for interactive problems in .ans, and the case itself in .in) we clearly should still rerun the case, since there's an observable difference. However, an author who won't use it this way isn't hurt by this anyway, since they can have identical .ans for identical .in. => it's a bit better but not worse to also check the .ans

niemela commented 1 year ago

My take from all of this is that we want to clarify two things:

For the latter, how about saying something like "judging systems are allowed to assume that all programs are deterministic"? This would imply that results from equivalent test cases (same test data, same output validator flags) may be reused, and also that the output from the submission may be reused for non-interactive problems when the test data is the same, but the output validator flags differ.

eldering commented 9 months ago

Just to add a counterpoint to Fredrik's second point

This would imply that creating problems that should be solved with a non-deterministic algorithm more difficult. Are we ok with that? (That's an honest question, I've never written or solved such a problem.)

niemela commented 9 months ago

This would imply that creating problems that should be solved with a non-deterministic algorithm more difficult. Are we ok with that? (That's an honest question, I've never written or solved such a problem.)

I think we are, but that is a very good question that we need to decide on.

@Tagl @jsannemo @RagnarGrootKoerkamp @ghamerly @pehrsoderman Thoughts?

ghamerly commented 9 months ago

This would imply that creating problems that should be solved with a non-deterministic algorithm more difficult. Are we ok with that? (That's an honest question, I've never written or solved such a problem.)

I think we are, but that is a very good question that we need to decide on.

If I understand correctly, I think this is a bit down the rabbit hole... The context is a problem for which a non-deterministic solution is likely to be submitted, and also uses an identical test case (in/ans/output validator flags). In this case a judge may have the option of re-using a judging result, meaning that the non-deterministic solution might not be run an additional time, meaning that they might not fail (or pass) appropriately on the second (or any subsequent) run. This is "easy" to work around (for the problem creator), by using superficially different test cases (including output validator flags). If I'm missing the point please let me know.

pehrsoderman commented 9 months ago

This would imply that creating problems that should be solved with a non-deterministic algorithm more difficult. Are we ok with that? (That's an honest question, I've never written or solved such a problem.)

I feel that the whole concept of reusing results of testcases in case they are identical is a can of worms. It seems much more straight forward to just execute all the testcases as described. And then if you really need a scoring that includes the same testcase in multiple groups, extend the language for how scoring is calculated to enable this.

Tagl commented 9 months ago

I agree with Greg, this is not a problem in practice since in the very rare cases you use the same input file but want it run differently, you can add a noop output validator flag to make the cases different.

On Mon, Nov 27, 2023, 15:17 Pehr Söderman @.***> wrote:

This would imply that creating problems that should be solved with a non-deterministic algorithm more difficult. Are we ok with that? (That's an honest question, I've never written or solved such a problem.)

I feel that the whole concept of reusing results of testcases in case they are identical is a can of worms. It seems much more straight forward to just execute all the testcases as described. And then if you really need a scoring that includes the same testcase in multiple groups, extend the language for how scoring is calculated to enable this.

— Reply to this email directly, view it on GitHub https://github.com/Kattis/problem-package-format/issues/98#issuecomment-1828036111, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2ZBKX7XMTG3Z25U437BJLYGSVJ3AVCNFSM6AAAAAA23ENXKWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRYGAZTMMJRGE . You are receiving this because you were mentioned.Message ID: @.***>

simonlindholm commented 9 months ago

I feel that the whole concept of reusing results of testcases in case they are identical is a can of worms. It seems much more straight forward to just execute all the testcases as described. And then if you really need a scoring that includes the same testcase in multiple groups, extend the language for how scoring is calculated to enable this.

It has some pitfalls, but I haven't ever encountered those in practice over the years I've been problemsetting, while test case reuse is basically a requirement for IOI-style problems (regularly cuts judging times by a factor of 3, say). We used to do it via graders, but that doesn't work when you want submissions to show UI for test group scores. Testcase result reuse based on identical inputs felt like the simplest solution, with symlinks being required in problemtools to make the reuse explicit. If we come up with some kind of format for test case metadata I could see some flag there filling the same function, but that change seems to me like a lot of effort for little gain.

RagnarGrootKoerkamp commented 9 months ago

I agree with @simonlindholm : The status quo of requiring symlinks for identical cases seems to work well enough.

I indeed prefer to not run submissions multiple times for identical input. Inconsistent results on equal testcases could easily cause confusion.

Tagl commented 9 months ago

The status quo of requiring symlinks for identical cases seems to work well enough.

I don't think a copy of a file should be treated any differently from a symlink, as Johan said above. What matters is that if the contents of the input files match along with flags, then it should be fine to reuse results, not because one of them is a symlink to the other.

RagnarGrootKoerkamp commented 9 months ago

Right sure, whether or not it's a symlink/hardlink/identical file shouldn't matter.

Then we should consider though whether we allow or require reusing results for identical cases. If we require this, that means systems must actively identify and deduplicate identical inputs, which is maybe too strong?

niemela commented 9 months ago

I don't think a copy of a file should be treated any differently from a symlink, as Johan said above.

There should be no difference in meaning, but I absolutely think it would be ok for tooling to use symlinks as a signal that file are identical on purpose. I.e. it should be ok to warn for identical copies, but not for symlinked files.

Then we should consider though whether we allow or require reusing results for identical cases. If we require this, that means systems must actively identify and deduplicate identical inputs, which is maybe too strong?

I strongly think we should allow but not require reuse. The way I like to say that is that a system is allowed to assume that submissions are deterministic for the purpose of determining results. Note that in addition to allowing reuse this also allows rerunning.

niemela commented 9 months ago

I think the consensus is:

(Note that the latter item is the same as "judging systems are allowed to assume that all programs are deterministic")

thorehusfeldt commented 9 months ago

Since the specification contains no definition of “what a test case is” (in particular, when two test cases are “identical”), the above rules are not as clear as we may think.

My own, private, definition is that at my attempt at writing a more precise problem package format, which you can inspect at https://github.com/thorehusfeldt/problem_package_format_copy/blob/develop/docs/source/format.rst . There I write:

A :term:test case has a unique test case name such as secret/043-no-edges and is determined by its input file, such as secret/043-no-edges.in. TODO: no consensus about this. In most problems, every test case have a :term:default answer file with the extension .ans. Several other files with the same base name and other extensions than .in may exist;

In summary, my hunch is that a testcase is defined by

  1. a name such as secret/043-no-edges. This name is problem-unique. (There are no two testcases with the same name in the same problem.)
  2. the contents of the corresponding .in-file, such as secret/043-no-edges.in. These contents are non-unique. (There can be several different testcase with the same contents.)

I think that’s what you all mean, but I’ve never seen a definition from any of you.

(In particular, if secret/foo.in is a symbolic link to sample2.in, so that the contents of the former are identical to the latter, then these are two different testcases. I think that’s how it should be. (Because different testdata.yaml apply to them, which can set different validator flags.) But again, I’m trying to read minds and shouldn’t define this; you all understand this much better.)

I believe that .files and .args complicates this (and obsoletes my definition.) Better people than me should try to define “what a testcase is” (well enough for us to make sense of the phrase “two identical testcases”).

RagnarGrootKoerkamp commented 9 months ago

@thorehusfeldt Yes your definitions seem correct to me.

When .files/.args are present those should also be equal for testcase content to be equal. Rephrasing as 'judging systems may assume programs are deterministic' probably ends up being more concise.

niemela commented 9 months ago

Since the specification contains no definition of “what a test case is” (in particular, when two test cases are “identical”), the above rules are not as clear as we may think.

The line in the spec that is supposed to define this says:

All files and folders associated to a single test case have the same base name with varying extensions.

I.e. a test case is defined by it's name (or by the set of files with the same path and base name, same thing really). I agree that this could and should be written better.

My own, private, definition is that at my attempt at writing a more precise problem package format, which you can inspect at https://github.com/thorehusfeldt/problem_package_format_copy/blob/develop/docs/source/format.rst . There I write:

A :term:test case has a unique test case name such as secret/043-no-edges and is determined by its input file, such as secret/043-no-edges.in. TODO: no consensus about this. In most problems, every test case have a :term:default answer file with the extension .ans. Several other files with the same base name and other extensions than .in may exist;

In summary, my hunch is that a testcase is defined by

1. a name such as `secret/043-no-edges`. This name is problem-unique. (There are no two testcases with the same name in the same problem.)

2. the _contents_ of the corresponding `.in`-file, such as `secret/043-no-edges.in`. These contents are non-unique. (There can be several _different_ testcase with the _same_ contents.)

Your definition is the same as mine, but then you also added item 2. Why is item 2 needed? If two test cases have the same "name" then they must have the same "contents", and if the "name" is different then they are considered different by both definitions. In either case item 2 does not make any difference.

I think that’s what you all mean, but I’ve never seen a definition from any of you.

(In particular, if secret/foo.in is a symbolic link to sample2.in, so that the contents of the former are identical to the latter, then these are two different testcases. I think that’s how it should be. (Because different testdata.yaml apply to them, which can set different validator flags.) But again, I’m trying to read minds and shouldn’t define this; you all understand this much better.)

I believe that .files and .args complicates this (and obsoletes my definition.) Better people than me should try to define “what a testcase is” (well enough for us to make sense of the phrase “two identical testcases”).

I don't think it complicates it much. I consider all the files for a test case to be part of the test case, including e.g. .hint files.