Rework the feedback table

bmesuere commented 1 year ago

The existing design of the Dodona feedback table has remained unchanged since its initial release. However, several challenges have emerged over time that the current layout does not adequately address:

The debugger is hidden and there is no easy way to add a button in the current design.
In cases where there are numerous tests with only a few failures, it becomes difficult to quickly identify the errors.
There is no current system for numbering the tests, which would aid in referencing them individually. (also see #357)
Sometimes, the indication of correctness relies solely on color, which can be problematic for users with color vision deficiencies.

The main obstacle in redesigning the feedback table is its inherent flexibility, which complicates the design process. The feedback is structured across several layers: tab, group, testcase, and test. A practical first step would be to focus on improving the group level and evaluating if this addresses the majority of the identified issues.

Currently, the group level, which includes the debugger, features minimal UI elements—a colored border on the left that also functions as a tutor button.

My suggestion is to transform each group into a card style element (no elevation, rounded corners, and a subtle hairline border). The card would have a header with a lightly colored background and a content area containing the current content (excluding the colored left border). These cards would be collapsible, showing only the header when collapsed.

The header would contain the number of the test group (linkable), the status (correct, wrong), a button to start debugging and a button to expand/collapse the group.

By default, I would collapse all correct groups, but also add a toggle to expand all (similar to the current diff option).

Atop the current tab, a visual representation of all the groups could be introduced—perhaps using colored circles—to improve the visibility of failed tests. Clicking a circle would bring the corresponding group into focus, potentially with an additional border color change to accentuate it further. This could be achieved using the :target CSS selector. To maintain visibility while scrolling, the tabs and group visualization would utilize position: sticky.

pdawyndt commented 1 year ago

If valuable for the rethinking the design: Individual testcases and groups now only have a Boolean status, but for students it might be more informative to make more distinction why a testcase/group failed (runtime error, time limit exceeded, memory limit exceeded). For that purpose, we're also missing a status for testcases/groups that have not been executed due to an earlier fatal issue (e.g. time limit exceeded for an earlier group).

https://github.com/dodona-edu/universal-judge/issues/342

pdawyndt commented 1 year ago

When drafting the TESTed 2.0 paper, we also renamed the tab-context-testcase-test grouping into unit-case-test-output, to align it better with unittest lingo.

The test suites have a hierarchical structure: a test suite may have multiple units. Each unit is tested by multiple test cases. Each test case has multiple tests (Figure 3) [...].

TESTed automatically reports errors that occur during generation and compilation of language-specific test harnesses. This behavior is hardcoded in its language-specific modules and needs no further configuration in test suites. However, a test case might specify input data that is made available when TESTed runs its language-specific test harness. The input data is made available as files on the file system, passed as arguments, streamed through standard input (stdin), or any combination thereof. The main call is automatically scheduled if needed and treated as a single test, as are the language-specific representations of all statements and expressions of the setup, script and teardown. When TESTed runs a test, it catches any runtime exception and output sent through the standard output streams (stdout and stderr). It also catches the return value when expressions are evaluated and the exit status when the process running the language-specific test harness terminates.

pdawyndt commented 1 year ago

As of late, for the Python judge we see students struggle with spotting the difference between the expected and generated runtime errors, especially if the mistake is in the message of the exception. The main difference between expected and generated traceback is that the generated message also includes the traceback that leads to the code, whereas the expected runtime error only contains the type and the message of the exception raised. The diff of the feedback table fails to link the lines containing the type and message of the exception raised, such that highlighting between lines is more misleading than helping the students.

dodona-edu / dodona

Rework the feedback table #5101