JeniT / linked-csv

A souped-up CSV-based data format
35 stars 6 forks source link

Support Deep-Hierarchic Structures in a Single Document #4

Open craig552uk opened 11 years ago

craig552uk commented 11 years ago

The Linked CSV standard allows for multiple rows to be joined on the $id field. However this can only represent list structures. Extending this and introducing a new prolog line join would allow recursive structures of any depth to be represented in tabular format.

The examples below illustrate how this might work with University course data represented as a 3-layered hierarchical data structure: Department > Course > Requirement

courses.csv

 #    ,$id0 ,department       ,$id1     ,course_name         ,location          ,level         ,grades
 join ,     ,                 ,#courses ,#courses            ,#courses          ,#courses      ,#courses
 join ,     ,                 ,         ,                    ,#location         ,#requirements ,#requirements 
      ,#AMS ,American Studies ,         ,                    ,                  ,              ,
      ,#AMS ,                 ,#T700    ,American Studies BA ,                  ,              ,
      ,#AMS ,                 ,#T700    ,                    ,On Campus         ,A             ,BBB
      ,#AMS ,                 ,#T700    ,                    ,Distance Learning ,IB            ,Pass diploma with 30 points
      ,#AMS ,                 ,#T700    ,                    ,                  ,Access        ,Pass diploma with 30 level 3 credits
      ,#AMS ,                 ,#T700    ,                    ,                  ,BTEC          ,Pass diploma with DDM
      ,#AMS ,                 ,#T701    ,American Studies MA ,                  ,              ,
      ,#AMS ,                 ,#T701    ,                    ,On Campus         ,A             ,ABB
      ,#AMS ,                 ,#T701    ,                    ,Distance Learning ,IB            ,Pass diploma with 32 points
      ,#AMS ,                 ,#T701    ,                    ,                  ,Access        ,Pass diploma with 30 level 3 credits
      ,#AMS ,                 ,#T701    ,                    ,                  ,BTEC          ,Pass diploma with DDM

courses.json

[{
  "@id": "#AMS",
  "department": "American Studies",
  "courses":[{
    "@id": "#T700",
    "course_name": "American Studies BA",
    "location": ["On Campus", "Distance Learning"],
    "requirements": [
      {"level": "A",      "grades": "BBB"},
      {"level": "IB",     "grades": "Pass diploma with 30 points"},
      {"level": "Access", "grades": "Pass diploma with 30 level 3 credits"},
      {"level": "BTEC",   "grades": "Pass diploma with DDM"}
    ]
  },{
    "@id": "#T701",
    "course_name": "American Studies MA",
    "location": ["On Campus", "Distance Learning"],
    "requirements": [
      {"level": "A",      "grades": "ABB"},
      {"level": "IB",     "grades": "Pass diploma with 32 points"},
      {"level": "Access", "grades": "Pass diploma with 30 level 3 credits"},
      {"level": "BTEC",   "grades": "Pass diploma with DDM"}
    ]
  }]
}]

The $id0 field is used to join table rows as a single record in much the same way as $id is used in Linked CSV. Subsequent $id* fields are used to join rows at lower levels of the data structure.

$id* fields must use incremental integers specifying the level of the structure that they apply to. The exception being $id which is an alias for $id0.

The scope (across fields) of joins at each level are specified by use of the join prolog lines. join statements must be listed in increasing order of specificity of the structure. In the attached example all fields set to join under the #courses identifier are joined in to an object on the second tier of the structure.

A $id* field must have an associated join statement if it is greater than $id0. No join statement is needed to specify the scope of the top level ($id0) as it is assumed to be composed of all fields.

If a join statement is provided across multiple fields (e.g #requirements) those fields are joined in to an object. In this case if no $id* field is provided, each row of the table is considered to be a separate object.

If a join statement is provided across a single field (e.g. #location) that field is joined in to a list. No $id* field can be used in this case.

Multiple join scopes can be specified in a single statement, but all scopes in a statement must be contained within the scope of the preceding join statement (except the first).

I don't know if you think this enhancement is necessary in the Linked CSV spec, as deep hierarchic structures can be created by linking together multiple documents. But I think it would be nice to be able to accommodate this in a single file. What do you think?

JeniT commented 11 years ago

I really like the idea of incorporating nested data in a single file. I'm concerned that the mechanism above is a bit hard to use. I wonder about something like this:

#    ,$id ,department      ,courses,course_name        ,location         ,requirements,level       ,grades
about,    ,                ,       ,courses            ,courses          ,courses     ,requirements,requirements
     ,#AMS,American Studies,#T700  ,American Studies BA,On Campus        ,            ,A           ,BBB
     ,#AMS,                ,#T700  ,                   ,Distance Learning,            ,IB          ,Pass diploma with 30 points
     ,#AMS,                ,#T700  ,                   ,                 ,            ,Access      ,Pass diploma with 30 level 3 credits
     ,#AMS,                ,#T700  ,                   ,                 ,            ,BTEC        ,Pass diploma with DDM
     ,#AMS,                ,#T701  ,American Studies MA,On Campus        ,            ,A           ,ABB
     ,#AMS,                ,#T701  ,                   ,Distance Learning,            ,IB          ,Pass diploma with 32 points
     ,#AMS,                ,#T701  ,                   ,                 ,            ,Access      ,Pass diploma with 30 level 3 credits
     ,#AMS,                ,#T701  ,                   ,                 ,            ,BTEC        ,Pass diploma with DDM

This keeps the properties in the header line, and then references the properties from the (proposed) about line. It means adding a new column for the (identifiers for the) requirements, which is blank.

We'll have to assume that any column that is mentioned in the about line holds URLs, and that a blank value means a blank node (as with the $id column).

craig552uk commented 11 years ago

This is much clearer and less open to mistakes - I like it.

I'd like to suggest in as an alternate preposition to about. So the fields could be read like...

"course_name in courses"
"level in requirements"

Also, for readability, the specification could support an optional $ character pre-pending any field that is to be treated like an ID?