Kattis / problem-package-format

Kattis problem package format specification
https://www.kattis.com/problem-package-format
9 stars 13 forks source link

Add (problem) version #146

Open thorehusfeldt opened 6 months ago

thorehusfeldt commented 6 months ago

I suggest to introduce

#problem: {
    ...
    version!: #version_number | *"1.0.0"
    ...
}
#version_number: =~"\d+\.\d+\.\d+"

for instance

name: Hello World
version: 1.2.3

Use cases

  1. Thore Teacher uses a problem about balanced parentheses in his Algorithms and Data Structures course: https://itu.kattis.com/courses/KSALDAS/2020-Spring/assignments/fcd6pc/problems/itu.balance that (as most teaching material) evolves every year. This year, he wants to move a testcase from secret to sample, because many students make the same mistake (about forgetting to check if the final stack is empty). He thinks of the new problem as version 1.1. He uses this version number to
    • tag the relevant commit in his own repository
    • communicate TAs that “the problem is now 1.1; your printout of the teaching guide clearly says 1.0, so you should download the new PDFs. They shouldn’t make this mistake anymore.”
    • explain to Sonny Student (how is retaking the course) that they need to resubmit itu.balance because they solved 1.0 instead of 1.1. It would be great if the user interface showed this information to Sonny under Metadata.
    • would like the “course data export” to include information about which version of itu.balance was solved in the JSON output Note that both 1.0 and 1.1 are (or have been) installed on itu.kattis at various times (with the same name), and that this made sense for the course.
  2. During development of BOI 2023 problem boi23.tycho, Arnar finally contributes the Icelandic translation, bumping the version number from 0.5.12 to 0.5.22. Various Kanban boards and github tags are updated, much rejoicing ensues. (Nothing is installed on any judge server at this time, BOI is still weeks away.)
  3. Johan Sannemo lets Thore use unoreverse on Will Code for Drinks 2023, including the problem package. The problem was already used on Shot and Submit 2023, but Thore makes lots of changes, including changes to the input format. (In particular, accepted submissions to the first version no longer pass the second version.) The repo is still the same, including lots of testdata and submissions from Johan, Simon, and others, and everybody involved likes the fact that lines of development can be traced. The problem version eventually installed on https://wcfd23.kattis.com/contests/wcfd23/problems/unoreverse is thought of as version 2 of the problem. (This is the first time the problem appears on a Kattis server, it existed on a different judge server in version 1.)
  4. After running https://dpop21.kattis.com/contests/dpop21/problems/bakkesnagvendt, the problem was much improved, and moved closer to the children’s song that inspired it. The version that finally made it to https://open.kattis.com/problems/bakkesnagvendt is incompatible with the original but morally the same problem. Consequently, this is version 2, and 2.0.1 after it got the English translation.
  5. Problem https://open.kattis.com/problems/intervalscheduling started as itu.intervalscheduling. Greg asked Thore whether it could be namespaced to open, at the same time adding some extra testcases from Johan Sannemo. This (in Thore’s mind) bumps the version number from 1.0 (which I have in my own repo) to 1.1 (to which I don’t have the source files).
  6. [update: another example] Marc McAuthorson submits version 0.1.0 of problem idea goodbye to the Very Important Programming Contest, but it is not selected by the jury (there was already another string DP problem). He tweaks it a bit, makes it scoring, and sends 0.2.0 to the Syldavian Informatics Olympiad Qualifier (Alice Author is on both juries and remembers 0.1.0 of the problem), where it is accepted with some suggestions from Alice. The first version Andy shares on the the syldavian IO github is 0.3.0, from which development continues. Version 1.0.0 is the first one actually installed on sioq24.domjudge.com, and 1.0.2 (two character encoding issues arose during deployment) is finally used at the contest. A year, and some C++ library issues later, 1.1.0 is at the SIO local training server, and version 1.1.1 (with English translation) makes it to open.kattis.

These are just cases of the type I myself have been involved with, all of which used the concept of a “version of the problem”, as the problem develops or migrates between use cases, course instances, contests, or servers.

Suggested semantics

A version number is x.y.z. When x>0 then

This loosely follows the philosophy of Semantic Versioning, but I don’t think this needs to be overly specified; the problem author should be able to use the version number for whatever purpose they see fit.

Added: The field always exists; if not specified in problem.yaml its value is `"1.0.0".

niemela commented 6 months ago

It's unclear to me what guarantees are intended here.

Some questions from the point of view of a system that uses problems:

Assuming I have a problem installed with version x.y.z, and am asked to install the same problem with version a.b.c, what may/should/could I assume/do in these cases (where x > 0):

  1. a.b.c < x.y.z
  2. a.b.c = x.y.z
  3. a.b = x.y. c > z
  4. a = x. b > y
  5. a > x

At this point I would think that it would be more useful to have more strongly defined semantics. IMO, that's the only point in requiring the x.y.z format for versioning. Of course, that allows for a compromise in that we could say something like "if version is on the x.y.z format then \<strong semantics>, if not then \<weak semantics>", where the weak semantics could be as weak as "different is different, same is same".

eldering commented 6 months ago

I see a few cases to discern:

  1. Two versions of a problem have a similar story, but different input/output formats. This means that accepted solutions to one problem are not accepted to the other and vice versa. I think such problems should be considered completely different and not different versions of the same problem, since you can't say that one version is better or newer than the other.
  2. What remains is problem versions where the in/ouput format is "essentially" unchanged, let's assume for now that it is really unchanged. Then it could be that some submissions change verdict, for example since hard testcases were added or removed. Thus the whole problem might need to be rejudged after loading the new version.
  3. What remains are changes that won't require rejudging, like small typo fixes or clarifications in the problem text, adding an extra sample test case (assuming it was already covered by the secret test cases).

Looking then at Thore's list of examples:

  1. This is my case 3.
  2. I guess this would also fall under my case 3, but I'm not sure versions make sure during problem development. I think that using the version control system makes more sense there.
  3. This is my case 1 and should be considered a totally different problem. Of course it could somehow reference the original problem as origin.
  4. Same as previous one.
  5. This is my case 2.

Are there any other cases that are not covered by this?

thorehusfeldt commented 6 months ago

Thank you both for interacting with this topic!

First, a general observation

Stakeholder in the Problem Package Format

There are (at least) three or four stakeholders in the probolem package format.

  1. Maintainers of judge systems such as Kattis or Domjudge.
  2. Mainters of problem development systems such as BAPCtools or Problemtools.
  3. Maintainers of problem archives such as Open Kattis or various National Olympics reservoirs of “previous years’ problems”
  4. Problem authors (typically as a team invested in preparing a programming contest, but also individuals who prepare teaching materials)
  5. Teachers (who may or may also be authors; teachers use or develop/maintain problems for courses, not contests, sometimes reusing them with variations over many years). Or Thore gives a course in Algorithmic Problem Solving where problem development of a problem in the problem package format is a learning goal in the curriculum.

(There are probably other stakeholders I am blind to. We are all unaware of the perspectives of others.)

What a Contest Judge-System Needs the Version Tag For

Nothing.

A contest judge has exactly one version of “the problem” installed. Whether that version is 1.0.0 or 4.2.1 plays no role. Problems are probably identified by server-unique ID, which may or may not be the uuid, problem name (in English), problem id, or whatever. It’s not important. There will never be “two versions of the same problem” on the same server.

This leads to FN’s question:

It's unclear to me what guarantees are intended here.

None. When I (as a teacher), ask itu.kattis to install some problem then I expect that to be done, it’s covered by the license fee. Whether I (as the problem author) bump the version number, or change the title, scoring, etc. is up to me.

(Of course, Kattis and I could establish a convention that I will bump the version number. I (Thore, as teacher and license owner) would like that: Kattis’ quality assurance asks me to bump the version number according so some scheme; much like what FN is partially sketching above. Great! But I (Thore, as contributor to the problem package format) at this point have no interest in establishing consensus about specifying this.)

All I want from a judge system is that it displays the version number under problem metadata, in particular when the version number differs from 1.0.0, in particular I would like course-platform Kattis to do this.

Other Versioning Schemes

I have realised that Semantic Versioning often comes in the way of establishing consensus.

So let me be clear here: I have nothing against other versioning schemes. In particular, as a course teacher, Calendar Versioning make a lot of sense. I think of last year’s version of itu.balance as “last years version”, so it would make complete sense to set it to version: 2023.1. Suits me fine. If I get myself well enough organised, I’ll update by repo with some changes, as Kattis to install it. I’d love to call it version: 2024.1 instead of version: 2.0.1 or whatever.

So, I hereby amend my opening message with something like:

#version_number: #calver | #semver
#semver: =~"\d+\.\d+\.\d+"
#calver: =~"\d+\.\d+"

Summary

In summary, I want a version key, and it is mainly as a problem author (in order to communicate with myself and other authors during development, problem evolution, and maintenance) and as a teacher (as a minimum such that I and my students and TAs known which problem we’re talking about). These interests are different than the perspective of an online judge. Some of my wishes are addressed by a uuid, but I’d like something that communicates intent and intellectual heritage to humans.

On the other hand, I am very flexible on the details of which versioning scheme is adopted, and what the semantics should be.

(Alas, I have to do other things now – grading 126 written Algorithms Design exams. I just wanted the the above two points off my chest. I’ll return to this issue at a later point.)

thorehusfeldt commented 6 months ago

Semantics of versioning

Let me try to address @niemela ’s (good) questions about the semantics of versioning.

First, my own position is:

the semantics of versioning does not need to be defined in the specification.

(For an analogy, we don’t define what author means. If Thore wants to add Ragnar as a problem author then that’s up to Thore, the specification does not need to specify the extent of Ragnar’s contribution. Instead, the author field is mainly meant to communicate problem metadata among humans. Of course, an online problem archive like Open Kattis may choose to make the author field searchable, and a contest system like BAPCtools may choose to publish the author name for the solution walkthrough. It can also ignore the field.)

Second, I am open to much more lenient version strings than #version: =~"\d+\.\d+\.\d+". Here are three proposals:

#version: #calver | #semver // A
#semver: =~"\d+\.\d+\.\d+"
#calver: =~"\d+\.\d+"
---
#version: string  // B
---
#version: =~"\d+\.\d+\.\d+" // C

I’m fine with any of these. With B, any requirement for semantics is void. The rest of this message is about #semver just to be explicit. Let me repeat that my answer is conditional on the assumption that semantics need to be defined, and that am not sure I agree with that assumption.

With these caveats, let me answer the questions of @niemala :

Assuming I [niemela] have a problem installed [on a judge server] with version x.y.z, and am asked to install the same problem with version a.b.c, what may/should/could I assume/do in these cases (where x > 0): […]

I would assume that depends on the server. open.kattis should be curated by “the Kattis people” (who can and should do what they want), whereas the rules for boi23.kattis.com or itu.kattis.com would be different (maybe even depending on a license.). I don’t think such conventions should be part of the specification.

However, here’s a plausible set of expectations:

  1. If a.b.c <= x.y.z then niemala should send a polite (possibly auto-generated) message back of the form “please change the version number, for instance to f"{x}.{y}.{z+1}”.
  2. If the new version differs in data/* or output_validators, the message might suggest to change to f"{x}.{y+1}.{0} f"{x+1}.{0}.{0} (up to the author.)
  3. When a>x or b>y (either in the version field of the metadata, or as a result of the previous point), I expect Kattis admins to ask me “I can see that the aggregate verdict of some previous submissions may have chanced. Should we re-judge those on the new version?” No matter the answer, I would then expect Kattis to retain the problem-versioned verdicts for old submissions. (So that I, as a teacher, can see that S. Student received AC on itu.balance version 1.13.1 in 2022-02-12 12:56, modulo GDPR-compliance.)
  4. Same answer for a (hypothetical) hotfix during a (say, week-long) contest aimed at high schoolers. As a head of jury I would like to know that Penelope Pupil got 37 points on version 1.0.0 of the problem (even though the problem version running since Wednesday, 1.3.1` actually breaks Penny’s submission). This would be highly irregular, and shouldn’t happen in a high-profile contest, but I can see this happening in a local high school contest, say, on student-submitted problems on a home-made judge systems that use the problem package format. (Think of the judge system as a student project in a devops course.)

With all the above caveats, here’s finally the concrete answer to a concrete question:

  1. a.b.c < x.y.z: Reject request, ask for higher version number.
  2. a.b.c = x.y.z: Reject request, ask for higher version number.
  3. a.b = x.y but c > z: Reinstall
  4. a = x but b > y: Reinstall, ask author for re-judging policy
  5. a > x: Reinstall, ask author for re-judging policy
niemela commented 2 months ago

I have re-read, and re-thought this, so let me try to re-state my current thinking:

Regarding having some (any) kind of versioning:

This is clearly useful, it allows us to ask and answer questions like "what version is installed" and "did you install version X". This is possible with any kind of versioning scheme, and it's not (really) possible right now.

Regarding what kind of versioning we should use:

I can see three different kind of schemes, from weakest and simplest, to strongest and most complicated:

My argument against the use of semantic versioning is simply that if we don't specify this additional level of semantics. (i.e. what can I assume when the minor but not major version has changed?), then we actually don't gain anythign for taking the extra step. Furthermore, if something looks like semantic versioning, it's reasonable to assume that it is. So, if we don't want to define the exact semantics, I don't think we should have a versioning scheme that "looks like" semantic versioning.

My fear with versioning, is that it adds a false sense of safety. All the nice things above only applies if authors use versions correctly, and we have no checks for verifying if a package has changed or not.

One thing we could do is to define some kind of hash of a package, but if we did, that could pretty much be used instead of a version for at least some of the benefit.

In summary, I'm leaning towards the simplest version. I.e. a version field with no further semantics than "different is different". This allows different systems and people to communicate the name of the version they have. It does allow individual users to specify stricter semantics on their use, but would not make this universal.

Other thoughts?

niemela commented 2 months ago

One more question regarding this... As usual naming is hard :). Currently we call the version of the problem package format that is targeted problem_format_version and the original suggestion for the problem version is version.

I think this make sense. I would argue that there is an implied "of the problem" on everything in problem.yaml. I.e. "name" means "name of the problem", so "version" is "version of the problem". That's why the problem version can use the short form and the problem package format version must use the long.

Would somebody like to make a good argument for some alternative?

RagnarGrootKoerkamp commented 2 months ago

problem_format_version sgtm.

version could be mistunderstood for problem format version when problem_format_version is not present, but since both will only available in the new version, problem_format_version will be required to use the version field in the first place, so that's probably fine?

Alternatively problem_version also works, but I agree with your reasoning that problem_ is implicit.

Tagl commented 2 months ago

By the same argument, would format_version not be sufficient? I don't mind which one is used.

I do believe there is value in having an ordered version. In which cases would an ordered version not work? Internal changes within archives that aren't from the original source/author may require some special handling.

I agree that if you change the input/output format entirely then it is essentially a different problem.

From my experience this is what can happen and how I view the severity of the changes.

niemela commented 2 months ago

By the same argument, would format_version not be sufficient?

Maybe to some extend, but I would argue at least slightly less so. It does make sense to say that this is the "problem format version of the problem".

I don't mind which one is used.

I agree that either would work.

I do believe there is value in having an ordered version. In which cases would an ordered version not work?

I think only when problems branch. I.e. when you shouldn't IMO use version to distinguish them anyway, but rather treat it as a new problem. I don't think that "variants" of problems should be considered "versions" of the "same problem".

Internal changes within archives that aren't from the original source/author may require some special handling.

Yes, but that's an issue with all of the options.

I agree that if you change the input/output format entirely then it is essentially a different problem.

From my experience this is what can happen and how I view the severity of the changes.

  • Fixing a bug in the solutions under submissions/accepted is the most breaking change you can make without changing the problem. Rejudging is necessary and old results are for sure not reliable.

  • Changing test data, validation or the grading method (aggregation or scores) is also a breaking change but it is a bit more nuanced than the above. Here old results may or may not be reliable, old results could be used but their result should be linked to the old version. Here the intent of the change is important.

  • Altering the statement, fixing metadata or providing an attachment is not a breaking change. Old results are reliable.

This would only be relevant if we choose the "(Something like) Semantic versioning" option, but I think that there are three natural levels of "consequences" of an update:

It seems to me that it would be reasonable to map these three (in reverse order) to a major, minor, and patch changes in a semantic versioning system.

Tagl commented 2 months ago

So non-accepted here means not full score? Or simply not an accepted verdict?

niemela commented 2 months ago

So non-accepted here means not full score? Or simply not an accepted verdict?

That's a great question. I think the answer is yes (but it's not perfect in all cases).

Tagl commented 2 months ago

Then I agree with the semantics you described and I think it would be nice to include this. Now is rejudging a recommendation or a requirement? I'd go with recommendation. So two options I see in that case: Option 1:

Option 2:

niemela commented 2 months ago

@RagnarGrootKoerkamp @mzuenni Do you have any thoughts in this?

RagnarGrootKoerkamp commented 2 months ago

No strong opinions on this from my side. I can see that it's useful in general and for semantic clarity but just for contests there's not that much technical benefit.

Actually being able to mark a problem version 2 to indicate IO changed during development may have some value and we can move outdated submissions to submissions_v1 or so. (But in the end that's not really different from just calling it something less structured like submissions_old.)

Generally, having a 3 digit version number sounds somewhat overkill. Lowest increment should be no rejudging, but do we really need the distinction between 'rejudge all' and 'rejudge non accepted'? One could argue any change to testdata makes it into a new problem since the set of accepted solutions may change.

niemela commented 2 months ago

Generally, having a 3 digit version number sounds somewhat overkill. Lowest increment should be no rejudging, but do we really need the distinction between 'rejudge all' and 'rejudge non accepted'?

"Need" is (always) a strong word, but it would certainly be really useful if it is actually set correctly. I worry that it wouldn't be a lot of the time, and then it would either give systems false security causing errors, and/or annoy users that did set it correctly and systems still didn't automatically do the right thing.

RagnarGrootKoerkamp commented 2 months ago

So does that mean a) we should keep things simple to make it easier to use correctly, or b) provide the 3-level version so that those who want can use it correctly, even though most won't?

Generally, we should be clear that this field has a very specific meaning when re-importing, and is not just something users can just do whatever they feel like.

niemela commented 2 months ago

So does that mean a) we should keep things simple to make it easier to use correctly, or b) provide the 3-level version so that those who want can use it correctly, even though most won't?

The problem is that if we actually think that most use it correctly, then tooling can't (shouldn't?) trust the information anyway, so it becomes useless (or at least much less useful).

Generally, we should be clear that this field has a very specific meaning when re-importing, and is not just something users can just do whatever they feel like.

Yes, if we apply very specific semantics, we must make that clear.

niemela commented 2 months ago

After thinking more about this and having some offline discussions I'm more and more leaning towards the simplest version of this:

If someone wants to argue for one of the stringer versions of version, please do so now. Otherwise I'll try to write a PR for this shortly.