fchollet / ARC-AGI

The Abstraction and Reasoning Corpus
Apache License 2.0
3.31k stars 548 forks source link

Some OCD concerns about the guides #127

Open basilkorompilias opened 2 months ago

basilkorompilias commented 2 months ago

Hey there! So I just want to add some input about the way you describe the data which is critically important for us to understand them right away. There are three types of descriptions, one on Kaggle, one on the website, and one here which I just found on the readme page. On the readme page of this repo is the best I believe presentation of the hierarchy and logic of the files and the most clear (too bad for me that I did not open every link first and spend hours tying to figure out the basics).

My concern is with the way that the actual datasets are structured. More specifically, when we take the "arc-agi_training-challenges.json", the tests are placed first, bringing some sort of confusion in the way that we see them. This might sound trivial to many, and computational it might not be a problem at all for well-structured models, but logically they should come after the training - as they are mentioned in the guides.

Before finding the clear and very direct explanation on this repo, I made the following which I give you in case you wish to consider adjusting it as you wish, and placing it in Kaggle and your website, so to improve your guides.

I also have a concern about the tern "Train" when we discuss about AGI, but I will make a different thread about this.

Why this is important?

P.s. If I am the only one who sees it as unorthodox, please excuse me because I am not an engineer, but a designer and information architect first. I just hope my input can help you become more consistent and specific - which is important when outlining tasks.

Cheers, Basil.


Dataset Structure Overview

Each dataset is a collection of tasks, uniquely identified by an ID. Each task includes training data to develop models and test data to evaluate their performance.

Tasks Collection:

ID Object

Here is the structured representation with emphasis:

Example structure from the JSON file with the correct hierarchy:

{
  "ID": {
    "train": [
      {
        "input": [
          [/* grid data */]
        ],
        "output": [
          [/* grid data */]
        ]
      },
      {
        "input": [
          [/* grid data */]
        ],
        "output": [
          [/* grid data */]
        ]
      }
      // more train entries
    ],
    "test": [
      {
        "input": [
          [/* grid data */]
        ]
      }
      // only one test input per ID
    ]
  }
  // more IDs
}