malcolmsharpe commented 4 months ago

I manually solved the entire public eval set in the ARCreate Playground. While doing so, I noticed errors and ambiguities in some tasks.

Here is the list of all the errors and ambiguities that I recorded. Please note that the versions of the tasks are the ones that were live in the ARCreate Playground during the dates I was solving them (2024-06-17 - 2024-07-03), so they may not reflect the latest commits to the repo. For each task, I give its number in the ARCreate Playground, as well as its filename.

(My original motivation was to determine human baseline performance, which I discuss in the last section.)

Errors (recoverable)

These tasks had errors, but I was able to produce an accepted test output anyway.

5 (05a7bcf2) Output 1: The right-most red shape has two extra red pixels.
96 (3ee1011a) Input 1: The green bar should be length 3, not length 4.
110 (4852f2fa) Output 1: The second copy of the shape has one pixel in the wrong position.
245 (9def23fe) Output 1: In the upper-left, the two red pixels should be black pixels.
321 (d2acf2cb) Demonstration 2: Input and output are swapped.
356 (e6de6e8f): Output 3 and Output 4: The red pixel path is wrong in both these outputs.
296 (c35c1b4c) Output 3: Missing a red pixel.
203 (85fa5666) Output 2: The orange line heading down-left stops too early.
330 (d931c21c) Output 4: Bottom rectangle has an extra red pixel in its upper-left corner.
336 (dd2401ed) Output 4: The red pixels left of the grey divider should be blue.
336 (dd2401ed) Output 3: The grey bar should have gone to the right of the first group of red pixels, I think.
260 (ac0c5833) Demonstration 3: Red pixel patterns in input and output don't match.
254 (a8610ef7) Output 4: The top and bottom pixels of the 2nd column should be red. (This is an exception where my test output wasn't accepted.)
306 (c92b942c) Output 2: Has 4 erroneous green pixels in the bottom row and right column.
342 (e1d2900e) Output 1: The blue pixel at the bottom should have moved rightwards.
231 (97239e3d) Test Input: Two donuts incorrectly have red pixels inside.
288 (bd14c3bf) Test input: Unlike every demo, the input top-left shape isn't red.

Ambiguities

These tasks have more than one test output that is consistent with the demonstrations, in my opinion. In some cases, I needed multiple tries to produce an accepted test output.

19 (0d87d2a6) Test Input: There's a rectangle that is just grazed, not pierced. Should it change color?
77 (310f3251): What should happen if there's a colored dot on the left or top edges such that going up-1 left-1 arrives on a black spot?
186 (79fb03f4): In the test middle path, when the dark blue is blocked but can't wrap around below, should it go partway?
381 (f3b10344): Test input contains non-center-aligned same-color rectangles. Demos don't. Should they connect?
277 (b7cb93ac) Test input: Green and pink are equivalent pieces when rotated. Two valid solutions.
271 (b15fca0b) Test input: This is the only time that multiple locally minimal paths are available.
215 (8fbca751): Both rectangle-filling and 4x4-square-filling fit the demonstrations, but they give different results on test.
264 (ad7e01d0): Is the key color always grey or the most frequent color? It matters for test input but not the demos.

Errors (unrecoverable)

For these tasks, I was not able to produce an accepted test output, so I suspect errors exist. I have not checked the files.

275 (b4a43f3b): This seems easy but it isn't accepting my output.
254 (a8610ef7): I have an idea that explains the demonstrations aside from the tiny error in Output 4 (mentioned previously), but my test output isn't accepted.
125 (50f325b5): Output 2's lower-right corner shows that mirroring the shape isn't allowed, but the reference test output mirrors the shape. (This is an exception where I was able to produce an accepted test output.)
138 (58e15b12): The idea is easy but my output isn't accepted.
100 (423a55dc): The idea is easy but my output isn't accepted.

Human baseline performance

I was able to find a working idea for every task, including the ones with ambiguities and unrecoverable errors, so, if these are fixed, I would consider human performance on the public eval set to be 100%. Without the fixes:

1.3% of tasks have unrecoverable errors, reducing solve rate to 98.8%.
7.0% of tasks have any error or ambiguity, so 93.0% of tasks are entirely fair.

Thus, human performance is somewhere between 93.0% and 98.8%, depending on the impact of recoverable errors and ambiguities.

However, practical human performance will be negatively impacted by the difficulty of producing the test output. Some tasks are unergonomic (particularly the numerous fill-in-the-pattern tasks), where it is challenging and time-consuming to produce the test output even if you have the right idea. Ergonomics could be improved with a more powerful solver tool: in particular, copy-paste with mirroring and rotation would simplify fill-in-the-pattern tasks, among others.

lukegustafson commented 4 months ago

FYI I think some of these have been fixed already, e.g. see https://github.com/fchollet/ARC-AGI/commit/b7fd42c53f0c26a807ba0b00e42f858d2c11d125

nblaxall commented 4 months ago

Thank you for doing this. BTW for the "Errors (unrecoverable)" section, I tried the following three (as of 2024-07-19) and got the correct solution on my first attempt for each of them: 275 (b4a43f3b) 138 (58e15b12) 100 (423a55dc) So I didn't see an error with these three (at least on my attempt). Maybe they have been fixed, or may it's just your human attempt messed up (which is understandable when doing a lot of them and getting tired, etc).

Update: oh I think they may have been fixed actually!

neoneye commented 3 months ago

I think you may have been using an editor that is out of sync with the latest ARC-AGI dataset.

There has been lots of fixes to this ARC-AGI repo around 2024-may/june, leading up to the launch of ARC-Prize.

I checked some of the issues you reported, and they seemed fixed.

This is a huge list of issues that is hard to work with. I recommend that @malcolmsharpe splits this into smaller separate issues, and close this huge github issue.

Examples of well formatted github issues regarding specific tasks.

gth001 commented 3 months ago

We have an active discord chat with a #tasks channel where we discuss individual mistakes, if you'd like to talk about these. For example 50f325b5 is discussed at https://discord.com/channels/1237180803335720961/1263536157728182353 . Each task gets it's own discussion thread with a picture, so it's easy to discuss.

fchollet / ARC-AGI

Public eval set: errors and ambiguities detected during manual solve #123

Errors (recoverable)

Ambiguities

Errors (unrecoverable)

Human baseline performance