fchollet / ARC-AGI

The Abstraction and Reasoning Corpus
Apache License 2.0
3.31k stars 548 forks source link

Public eval set: errors and ambiguities detected during manual solve #123

Open malcolmsharpe opened 2 months ago

malcolmsharpe commented 2 months ago

I manually solved the entire public eval set in the ARCreate Playground. While doing so, I noticed errors and ambiguities in some tasks.

Here is the list of all the errors and ambiguities that I recorded. Please note that the versions of the tasks are the ones that were live in the ARCreate Playground during the dates I was solving them (2024-06-17 - 2024-07-03), so they may not reflect the latest commits to the repo. For each task, I give its number in the ARCreate Playground, as well as its filename.

(My original motivation was to determine human baseline performance, which I discuss in the last section.)

Errors (recoverable)

These tasks had errors, but I was able to produce an accepted test output anyway.

Ambiguities

These tasks have more than one test output that is consistent with the demonstrations, in my opinion. In some cases, I needed multiple tries to produce an accepted test output.

Errors (unrecoverable)

For these tasks, I was not able to produce an accepted test output, so I suspect errors exist. I have not checked the files.

Human baseline performance

I was able to find a working idea for every task, including the ones with ambiguities and unrecoverable errors, so, if these are fixed, I would consider human performance on the public eval set to be 100%. Without the fixes:

Thus, human performance is somewhere between 93.0% and 98.8%, depending on the impact of recoverable errors and ambiguities.

However, practical human performance will be negatively impacted by the difficulty of producing the test output. Some tasks are unergonomic (particularly the numerous fill-in-the-pattern tasks), where it is challenging and time-consuming to produce the test output even if you have the right idea. Ergonomics could be improved with a more powerful solver tool: in particular, copy-paste with mirroring and rotation would simplify fill-in-the-pattern tasks, among others.

lukegustafson commented 2 months ago

FYI I think some of these have been fixed already, e.g. see https://github.com/fchollet/ARC-AGI/commit/b7fd42c53f0c26a807ba0b00e42f858d2c11d125

nblaxall commented 1 month ago

Thank you for doing this. BTW for the "Errors (unrecoverable)" section, I tried the following three (as of 2024-07-19) and got the correct solution on my first attempt for each of them: 275 (b4a43f3b) 138 (58e15b12) 100 (423a55dc) So I didn't see an error with these three (at least on my attempt). Maybe they have been fixed, or may it's just your human attempt messed up (which is understandable when doing a lot of them and getting tired, etc).

Update: oh I think they may have been fixed actually!

neoneye commented 1 month ago

I think you may have been using an editor that is out of sync with the latest ARC-AGI dataset.

There has been lots of fixes to this ARC-AGI repo around 2024-may/june, leading up to the launch of ARC-Prize.

I checked some of the issues you reported, and they seemed fixed.

This is a huge list of issues that is hard to work with. I recommend that @malcolmsharpe splits this into smaller separate issues, and close this huge github issue.

Examples of well formatted github issues regarding specific tasks.

gth001 commented 1 month ago

We have an active discord chat with a #tasks channel where we discuss individual mistakes, if you'd like to talk about these. For example 50f325b5 is discussed at https://discord.com/channels/1237180803335720961/1263536157728182353 . Each task gets it's own discussion thread with a picture, so it's easy to discuss.