alan-turing-institute / climate-informatics-2024-ae

Artefact evaluation (AE) of research objects submitted to Climate Informatics 2024
https://alan-turing-institute.github.io/climate-informatics-2024-ae
Other
3 stars 0 forks source link

Artefact Evaluation badges & criteria #3

Open rolyp opened 6 months ago

rolyp commented 6 months ago

Summary Sentence

Badges that may be awarded to authors that participate in the Artefact Evaluation process.

Subtasks:

See also:

rolyp commented 6 months ago

@dorchard @MarionBWeinzierl @acocac @cassgvp I’ve created a separate issue for this since I imagine these being refined quite a bit as we move forward. Added a link to the current ACM process.

I feel we should probably stick to something broadly equivalent to the first 3 ACM criteria, i.e. Available, Functional, and Reusable, where “Functional” means (broadly) “Reproducible” (the same outputs can be obtained independently using the author’s artifacts). (There may be a case for preferring the term “Reproducible” to “Functional” to make this explicit.)

The ACM has 2 additional criteria/badges, “Results Reproduced” and “Results Replicated”. I think we can ignore both of these: the latter is clearly out of scope and the former is, for our purposes at least, mostly subsumed by Functional.

I think the key point for us is that “Functional” should mean functional with respect to the (computational) results presented in the paper, and “Reusable” should imply Functional.

rolyp commented 6 months ago

Further question:

MarionBWeinzierl commented 6 months ago

Further question:

* Does Available say anything above and beyond author-declared Open Data and Open Materials? If not, will Available be useful/meaningful to have as a separate badge awarded in the addendum?

I think that Available is equivalent with the Open Data badge that CUP is awarding, only that it also extends to the software or is that included in that badge already, too?

MarionBWeinzierl commented 6 months ago

I started the checklist by copying over the description from the ACM page. We should think about slight rewording or examples, where necessary (e.g., I added a reference to FAIR and FAIR4RS).

cassgvp commented 6 months ago

Just adding a link to the hackmd for some context on the discussions: https://hackmd.io/@turing-es/By7jk3eIp

dorchard commented 5 months ago

I think this looks good to me. Should 'Available' also include mention of data, e.g., that relevant data sets are available where possible?

dorchard commented 5 months ago

I added a part to the 'Available' badge about tagging the version / doing a release.

MarionBWeinzierl commented 5 months ago

I think this looks good to me. Should 'Available' also include mention of data, e.g., that relevant data sets are available where possible?

You are right, we should probably add data explicitly. Although I'd think it's all covered under the first bullet point, so maybe that's enough?

dorchard commented 5 months ago

I updated the Reusable points which previously repeated points from functional ( Documented, Consistent, Complete, Exercisable) the latter three which I removed and added about being packaged to enable reuse.

dorchard commented 5 months ago

Moved the text in this issue to a file: https://github.com/alan-turing-institute/climate-informatics-2024-ae/blob/main/badges.md

dorchard commented 5 months ago

Are we happy with these now?

MarionBWeinzierl commented 5 months ago

Just one question about the "exercisable" point: When we talk about obtaining results, do we want to explicitly talk about generating figures, too, or are we happy with a dump of numbers?

rolyp commented 5 months ago

@MarionBWeinzierl I think it’s reasonable to expect figures to be reproducible (via some kind of script or manual process), with some room for reviewer discretion.

MarionBWeinzierl commented 5 months ago

OK, I added that under "exercisable"

rolyp commented 4 months ago

Added a TODO to look at the POPL 2022 AE reviewer guidelines, as it might be useful to add a bit more structure to the review format.

For example, they suggest organising reviews around specific content in the paper:

Q1: What is the central contribution of the paper? Q2: What claims do the authors make of the artifact, and how does it connect to Q1 above? Q3: Can you list the specific, significant experimental claims made in the paper (such as figures, tables, etc.)? Q4: What do you expect as a reasonable range of deviations for the experimental results?

Q9: Does the artifact provide evidence for all the claims you noted in Q3? This corresponds to the completeness criterion of your evaluation. Q10: Do the results of running / examining the artifact meet your expectations after having read the paper? This corresponds to the criterion of consistency between the paper and the artifact Q11: Is the artifact well-documented, to the extent that answering questions Q5–Q10 is straightforward?

I think the idea of focusing around specific claims made in the paper (in the form of specific figures or tables) is a good one, and might help reviewers make their reviews more evidence-based (and encourage authors to think of their artefacts in terms of how they support specific claims).